qualifying exam presentationumsl.edu/~adhikarib/talks/2014_qualifying_exam.pdf · qualifying exam...

Qualifying exam presentation

Badri Adhikari Student ID: 14155100 Department of Computer ScienceUniversity of MissouriColumbia, MO 65211 USA

10/13/2014

1. Marks, Debora S., Thomas A. Hopf, and Chris Sander. "Protein structure prediction from sequence variation." Nature biotechnology 30.11 (2012): 1072-1080.

2. My research projects.3. Wu, Sitao, and Yang Zhang. "A comprehensive assessment

of sequence-based and template-based methods for protein contact prediction." Bioinformatics 24.7 (2008): 924-931.

4. Hinton, Geoffrey, Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.

"Protein structure prediction from sequence variation."

Abstract

• Evolutionary information about functional constraints in genomic sequences can be mined to detect evolutionary couplings between residues in proteins.

• evolutionary couplings -> protein 3D structures

• Improved understanding of covariation may help identify functional residues involved in ligand binding, protein-complex formation and conformational changes.

• Computation of covariation patterns can complement experimental structural biology in elucidating the full spectrum of protein structures, their functional interactions and evolutionary dynamics.

Covariation

Covariation

• Several groups have demonstrated that extracting covariation information from sequences is sufficient to• Estimate which pairs of residues are close in three-

dimensional space

• And, also to fold a protein to reasonable accuracy

• These pairs of covarying residues should also be predictive functional sites, protein interactions and alternative conformations.

• To find true evolutionary covariation between residues, one must minimize the effect of transitive correlations.

the problem of transitive correlations• They are the false positive correlations observed.

• Example: two residues that contact the same third residue but do not actually contact each other.• If residues A and B contact each other, as do B and C,

then there is in general, a transitive influence observed between residues A and C (chaining effect).

• Local statistical methods assume that pairs of residue positions are statistically independent of other pairs of residues

Transitive correlations removed by global statistical approaches• Steps for “entropy maximization under data

constraints”:1. Create a multiple sequence alignment between many

members of an evolutionarily related protein family2. Calculate the covariance matrix (observed minus

expected pair counts) of dimension 20L2

• by counting how often a given pair of the 20 amino acids, say alaline and lysine, occurs in a particular pair of positions, say position 15 and 67, in any once sequence, summing over all sequences in MSA

3. To compute a measure of causative correlations, the conditional mutual information, take the inverse of the covariance matrix. This is the numerical estimate of direct pair interactions.

Transitive correlations removed by global statistical approaches

Transitive correlations removed by global statistical approaches• Using similar approach Lapedes and Jarzynski

reached a first break-through in contact prediction in 2002 for 11 small proteins• But they did not compute three-dimensional structures

• After removal of transitive correlations and other confounding effects, predicted contacts based on the global probability models provide a base for the computation of three-dimensional folds.

From contact predictions to protein folding• To what extent does improved contact prediction

lead to improved de novo prediction of 3D structures?• A folding protocol, EVfold, is developed

• Predicted residue contacts from coevolution patterns are translated into detailed atomic coordinates by using distance restraints placed on an extended polypeptide.

EVfold

• A 3D structure is calculated by constraining the distance between pairs of residues with high covariance scores using a standard distance geometry algorithm used to solve 3D structures with experimental constraints deduced from NMR spectroscopy data.

• Followed by simulated annealing by molecular dynamics to ensure correct bond lengths and side-chain conformations.

EVfold evaluation

• All-atom 3D structure coordinates were predicted from sequence alone for 15 diverse globular folds of up to 220 amino acids

• Overall accuracy of as low as 2.8-5.1A Ca r.m.s. deviation relative to experimentally determined structures.

• The accuracy of atomic coordinates were reported to be best (down to around 1A all-atom over 5-10 residues) around active sites.

Quality of predicted folds likely to improve over time• More sequence information tends to lead to higher

accuracy of distance constraints.

• Currently limited atomic accuracy is likely to improve with advanced molecular dynamics refinement methods (for example CNS, Rosetta, etc)

Structure prediction of membrane proteins• The structures of membrane proteins is notoriously

difficult to determine by crystallography or NMR spectroscopy.

• Another group has tested the ability to predict the 3D structure of membrane proteins on 25 membrane proteins with up to 487 residues.

• Two of the proteins in the data set are membrane proteins: G protein-coupled receptors and membrane transporters. The accuracy ranges from 2.6A to 4.8A, which is notable.

EVfold uses least structural information• Other global statistical modeling approaches use

fragment-based prediction methods• The Jones group, FILM3, predicted the structure of 32

known membrane proteins

• The Onuchic group, DCAfold, predicted structures of 15 bacterial protein domains up to 133 residues. This is comparable to EVfold.

• EVfold uses the least existing structural information and therefore shows the potential for prediction of unknown folds.

Applications of improved structure-prediction methods

Limitations

• Many of the predicted contacts involved in protein features (functional sites, homomultimer contacts, alternative conformations) may appear as false positives in the prediction of intradomain residue contacts. • A challenge for the field will be to develop algorithms that

can disambiguate the different functional constraints.

• Detection of evolutionary couplings between residues requires a substantially diverse set of sequences, which is not yet available for many families.• EVfold needs about 5L sequences in multiple alignment.• May be addressed over time.

Combine experimental and computational structural biology• Protein-structure determination by NMR spectroscopy

is ideally suited for a hybrid approach, as it is based on the determination of distance constraints.

• Combining reduced X-ray and NMR spectroscopy data sets with predicted three-dimensional models may open a new phase for structural biology with much more rapid determination of high-accuracy protein structures.

• Using the massive sequence data sets, successful decoding of the molecular record of evolutionary constraints could now reveal structural and functional information about proteins at an unprecedented rate.

1,250 alpha-helical transmembrane protein families known in mid-2012

My Research

“Contact assisted protein structure modeling.”

My projects

• Reconstruction of “hard” protein targets using residue contacts. Not much luck.

• Building 3D models using SVMcon, NNcon, and DNconpredicted contacts. Not much luck.

• Fragment assembly based model building. Not much luck.

• Combining various types of contacts. Not much luck.

• Contact filtering to improve model building. Not much luck.

• Beta sheet construction using DGSA protocol. Good news here.

• Folding proteins that can be folded with contacts. Working on a publication.

Contact-assisted fragment replacement based protein structure prediction

Randomly pick a fragment of (length 9)

Predict features (SS and SA) for this fragment

Lookup in the fragment database

to find the most matching fragment

Convert structure into angular space

and replace the fragment

Convert structure back to Cartesian

co-ordinates

Compute energy of the new structure

(using contact satisfaction score)

Extended structure

Final Structure

Monte-Carlo Simulated Annealing

Protein folding process for target T0716 demonstrated through a moviehttp://www.youtube.com/watch?v=HBONCqN9U4k

http://www.youtube.com/watch?v=HBONCqN9U4k

Current pipeline

My pipeline versus EVfold

Example: RASH_HUMAN

"A comprehensive assessment of sequence-based and template-based methods for

protein contact prediction."

Abstract

Method Easy targets Medium targets Hard targets

SVM-LOMETS Outperforms SVM-SEQ

SVM-SEQ Outperforms SVM-LOMETSby 12-25%

Combined Contact prediction accuracy improves by 60%

collects consensus contact predictions from multiple threading templates

sequence based machine learning approach which trains a variety of sequence-derived features

Dataset and contact definition

• 500 non-homogeneous proteins from PDBSELECT with a pair-wise sequence identity <25% and sizes ranging from 50 to 559.• Removed proteins with broken chains or missing entities or

format errors.• 22k/27k/28k contacts and 87k/107k/112k non-contacts in the

short/medium/long-ranges respectively.

• A pair of residues are in contact if their Ca atom distance is <8Å.• Sequence separation: 6-11 -> short range, 12-24 -> medium

range, and >24 -> long range.

• Acc = Ncorrect/Npredicted and

• Pct = Npredicted/L

SVM-SEQ

• Residue pairs in the training structures are categorized as ‘contacted’ or ‘non-contacted’.

• In principle, training data should cover as many as possible residue pairs.• Including too many pairs in the training data need a long

training CPU time.

• The number of non-contacted pairs is much larger than that of contacted pairs (>20:1).

• By trial and error, ratio of non-contacted/contacted residue pairs is kept as 4:1 by randomly selecting residue pairs.

SVM-SEQ Local window features

1. position-specific scoring matrices (PSSM)• generated by PSI-BLAST search of query against non-

redundant sequence database.

2. Secondary structure predictions• Predicted by PSIPRED• 3 states: alpha-helix -> [0 1 0 ], beta-strand -> [ 0 0 1 ], and

coil -> [ 1 0 0 ]

3. Solvent accessibility predictions• Predicted by neural network training• buried -> [ 0 1 ], and exposed -> [ 1 0 ]

• With 15-residue window, total number of local features for a pair is 750 (=2*15*(20+3+2)).

SVM-SEQ In-between segment feature sets1. Number of the residues between i and j, i.e. |i-j|

2. The compositional percentage of three secondary structure elements and two burial states of the in-between residues.

3. State distributions of the in-between residues, which are specified by four moments Fn = ⟨(k-⟨k⟩)n⟩, n = 1,2,3, and 4 where k (=m-i) is the position of the mth residues relative to i along the chain and each moment is calculated for five specific states of helix, strand, coil, burial and exposure.

4. The local features of five selected in-between residues which are evenly distributed between i and j.

• The SVM software developed by Joachims (2002) is used to classify the contacted and non-contacted residue pairs.• The accuracy of Neural Network was found 30% lower.

LOMETS

• LOMETS is a local meta-threading server which includes nine locally installed threading programs.

• For each target, LOMETS first thread a sequence through the PDB library to identify possible templates.

• The threading programs of LOMETS represent a diverse set of state-of-art algorithms using different approaches: sequence profile alignments, structural profile alignments, pair-wise potentials, and hidden Markov models.

• A consensus combination of the meta-server algorithms significantly outperforms the individual threading methods.

SVM-LOMETS

• One defect of the LOMETS prediction is the coarse-grained distance cutoff (e.g. a distance of 7.9Å or 8.1Å results in different assignment of contact or non-contact despite the tiny difference)

• In SVM-LOMETS, SVM algorithm is used to train the distance cutoff parameters and alignment qualities on the contact map.

• For each pair of residues (i and j), the training features are prepared.

SVM-LOMETS Features

1. Frequency of the contacts occurred on the top N templates (N is 10, 20, .. 90).

2. The average and SD of the Ca distance (dij) calculated from the templates which have dij < 12Å.

3. Number of continuously aligned residues within a 5-residue window

4. The burial depth of the residuesdistance from the Ca atoms of I and j to the centroid of the template structure divided by the radius of gyration

5. The average of normalized Z-scores6. The predicted TM-score of templates

separate SVM training

SVM-LOMETS training

• Nine different training datasets with 10, 20, …, 90 top templates.

• Targets are split into 3 categories: Easy, Medium, and Hard based on specific Z-score of programs.

• 27 SVM classifiers trained

• Number of templates of Z-score > 0.55 * Z0 to N is used to determine which SVM classifier is finally used to generate the contact predictions.

Results SVM-SEQ versus other machine-learning methods• Some modest (if any) advantage of SVM-SEQ in

comparison with current machine learning methods.

Results SVM-LOMETS versus LOMETS• an improvement of 5.2%

Results Template-based versus sequence-based methods

Results dependence of SVM-SEQ on target categories• Unexpectedly, results show

dependence of SVM-SEQ on the target categories (easy, medium, hard) since SVM-SEQ does not exploit template information.

• The larger number of homologous sequences helps construct a better PSSM which SVM-SEQ has been mainly trained on.• This explains the different

performance of SVM-SEQ in different categories.

Results combining SVM-LOMETS and SVM-SEQ• count the portion of correctly predicted contacts by

SVM-SEQ, which is not predicted by SVM-LOMETS.

• For hard/very hard targets, a combination of ofSVM-SEQ with SVM-LOMETS can result in an enlargement of total correct contact predictions by 62% compared with using SVM-LOMETS alone.

RESULTS New fold targets in CASP7• examine the sequence-based and template-based

methods on the 15 new fold (NF) targets as categorized in CASP7 experiment.• there is no similar structure solved in the PDB library for

these targets

• Compare with SAM-T06-server, the best server predictor in the CASP7 experiment.

RESULTS New fold targets in CASP7

Conclusions

• The accuracy of SVM-SEQ is comparable to the top sequence-based machine learning methods published.

• SVM-LOMETS (trained additionally using contact frequency, Ca-distances, and template qualities) generates slightly better contact predictions (by 5.2%) than the original LOMETS method.

• The overall accuracy of contact prediction based on templates if much higher than that by the sequence-based contact predictions.

• For Hard targets, SVM-SEQ generates contact predictions with an accuracy comparable or better than the template based predictions.• Incorporating the SVM-SEQ contact predictions in the I-TASSER

simulation results in an about 5% TM-score increase for the first models of the Hard targets.

• For the new fold targets in CASP7, accuracy of threading template-based contact prediction is close to random, SVM-SEQ generates contact predictions with about 20% of them being correct.

"A fast learning algorithm for

deep belief nets."

Abstract

• Use of “complementary priors” to eliminate explaining away effects that make inference difficult in densely connected belief nets that have many hidden layers.

• Derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time.

• After fine-tuning, a network of three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels.

• The generative model gives better digit classification than the best discriminative learning algorithms.

Belief Networks

Source: http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.pdf

Stochastic binary units


Restricted Boltzman Machines


Contrastive Divergence Algorithm


Source:http://cl.naist.jp/~kevinduh/a/deep2014/140116-ResearchSeminar.pdf

Gibbs sampling

• Gibbs sampling is a Markov chain Monte Carlo algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution (i.e. from the joint probability distribution of two or more random variables), when direct sampling is difficult.

Definition of General Complementarity• Consider a joint distribution over observables x and

hidden variables y.

• For a given likelihood function, P(x|y), we defined corresponding family of complementary priors to be those distributions, P(y), for which• The joint distribution P(x,y) = P(x|y)P(y), leads to

posteriors, P(y|x), that exactly factorize, that is, leads to a posterior that can be expressed as P(y|x) = ∏jP(yi|x)

Logistic belief net

• A logistic belief net is composed of stochastic binary units.

• When the net is used to generate data, the probability of turning on unit i is a logistic function of the states of its immediate ancestors, j, and of the weights, wij, on the directed connections from the ancestors:

where bi is the bias of unit i.

The phenomenon of explaining away

Even if two hidden causes are independent, they can become dependent when we observe an effect that they can both influence.

If we learn that there was an earthquake it reduces the probability that the house jumped because of a truck.

Posterior:p(1,1) = 0.0001p(1,0) = 0.4999p(0,1) = 0.4999p(0,0) = 0.0001

Complementary Priors

• The phenomenon of explaining away makes inference difficult in directed belief nets.

• To eliminate explaining away in the first hidden layer by using extra hidden layers to create a “complementary” prior that has exactly the opposite correlations to those in the likelihood term.

• Then, when the likelihood term is multiplied by the prior, we get a posterior that is exactly factorial.

• The use of tied weights to construct complementary priors is like a trick for making directed models equivalent to undirected ones.

An Infinite Directed Model with Tied Weights• We can generate data from infinite directed net like

any other directed acyclic belief net.

• We can sample from the true posterior distribution over all of the hidden layers by starting with a data vector on the visible units and then using the transposed weight matrices to infer the factorial distributions over each hidden layer in turn.

• At each hidden layer, we sample from the factorial posterior before computing the factorial posterior for the layer above.

An Infinite Directed Model with Tied Weights• Computing the derivative for a generative weight,

wij00, from unit j in layer H0 to unit i in layer V0. In a

logistic belief net, the maximum likelihood learning rule for a single data vector, v0, is

where <.>denotes an average of the sampled states and cap(v)i

0 is the probability that unit i would be turned on if the visible vector was stochastically reconstructed from the sampled hidden states.

Restricted Boltzman Machines and Contrastive Divergence• The infinite directed net (Fig 3) is equivalent to a

restricted Boltzman machine (RBM).

• To generate data from an RBM, we start with a random state in one of the layers and then perform alternating Gibbs sampling.• This is the same process as generating data from the

infinite belief nets with tied weights.

Restricted Boltzman Machines and Contrastive Divergence• To perform maximum likelihood learning in an RBM, we can use

the difference between two correlations. For each weight wij, between visible unit i and a hidden unit j, we measure correlation <vi

0hj0> when a data vector is clamped on the visible units and the

hidden states are sampled from their conditional distribution.

• Then using alternating Gibbs sampling, we run the Markov chain until it reaches its stationary distribution and measure the correlation <vi

∞hj∞>. The gradient of the log probability of the

training data is then:

• This learning rule is same as maximum likelihood learning rule for the infinite logistic belief net with tied weights, and each step of Gibbs sampling corresponds to computing the exact posterior distribution in a layer of the infinite logistic belief net.

Restricted Boltzman Machines and Contrastive Divergence

Restricted Boltzman Machines and Contrastive Divergence• Contrastive divergence learning in a restricted

Boltzman machine is efficient enough to be practical.

• Variations that use real-valued units and different sampling schemes are described in literatures.

• However, efficiency has been brought at a high price.

• When applied, contrastive divergence divergencelearning fails for deep, multilayer networks with different weights at each layer.

A Greedy learning algorithm for Transforming Representations• The equivalence between RBMs and infinite

directed nets with tied weights suggests an efficient learning algorithm for multilayer networks in which the weights are not tied.

A Greedy learning algorithm for Transforming Representations

A Greedy learning algorithm for Transforming Representations• Top two layers interact via undirected connections

and all of the other connections are directed. There are no intra-layer connections.

• It is possible to learn a sensible values (though not optimal) for the parameters W0 by assuming that the parameters between higher layers will be used to construct complimentary prior for W0.

• The task of learning W0 under this condition reduces to the task of learning an RBM, and good approximate solutions can be found rapidly by minimizing contrastive divergence.

• Once W0 has been learned, the data can be mapped through W0T to create higher-level “data” at the first hidden layer.

A Greedy learning algorithm for Transforming Representations• RBM will not be able to model the data perfectly.

We can make generative model better using the greed algorithm:

1. Learn W0 assuming all the weight matrices are tied.

2. Freeze W0 and commit to using W0T, even if

subsequent changes in higher-level weights mean that this inference method is no longer correct.

3. Keeping all the higher-weight matrices tied to each other, but untied from W0, learn the RBM model of the higher-level data that was produced by using W0

T to transform the original data.

Performance on the MNIST Database

Performance on the MNIST Database• The database contains 60,000 training images and

10,000 test images.

• The performance of the network was 1.25% errors on the official test set.

• The only standard machine learning technique that comes close to 1.25% error rate is a support vector machine that gives an error rate of 1.4%.

Performance on the MNIST Database

Looking into the Mind of a Neural Network• To generate samples from the model, perform

Gibbs sampling in the top-level associative memory until the Markov chain converges to the equilibrium distribution.

• A sample from this distribution is used as input to the layers below and generate image.

Looking into the Mind of a Neural Network

Conclusions I

• It is possible to learn a deep, densely connected belief network one layer at a time.

• This can be done by assuming that the higher layers do exist (in contrast).

• Assume that higher layers have tied weights that are constrained to implement a complementary prior that makes the true posterior exactly factorial.

• This is equivalent to having an undirected model that can be learned efficiently using contrastive divergence.

Conclusions II

• After each layer has been learned, its weights are untied from the weights in higher layers.

• As the higher-level weights change, the priors for lower layers cease to be complementary, so the true posterior distributions in the lower layers are no longer factorial.

• Nevertheless, adapting the higher-level weights improves the overall generative model.

Conclusions III

• It might be better to learn an ensemble of larger, deeper networks

• The implemented network has as many parameters as 0.002 cubic millimeters of a mouse cortex, and several hundred networks of this complexity could fit within a single voxel of high-resolution fMRI scan.

• This suggests that much bigger networks may be required to compete with human shape recognition abilities.

Conclusion IV Advantages over discriminative models• Generative models can learn low-level features

without requiring feedback from the label, and they can learn many more parameters than discriminative models.

• It is easy to see what the network has learnt by generating from its model.

• It is possible to interpret the nonlinear, distributed representations in the deep hidden layers by generating images from them.

“Thank you."

qualifying exam presentationumsl.edu/~adhikarib/talks/2014_qualifying_exam.pdf · qualifying exam...

Documents