20131019 生物物理若手 journal club

Post on 14-Jun-2015

269 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16. DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Liu R, Hu J.

TRANSCRIPT

DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Proteins. 2013 Jun 5.

20131019生物物理若手関西支部 Journal Club

Topics

Prediction of protein-DNA binding residues

Statistics of network

Machine learning

Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding

residues.

Query protein, Template protein, TP, FP, FN

Machine learning Template DNABind

CprK

(3E6

C:C)

EcoR

V(1R

VE:A

)

DNABind improves classification.

True positive residues.

Aim

Protein-DNA interactions is important for cell biology.

Its determination by experiments is time- and cost-consuming.

Computational approaches are desirable.

Computational approaches

Data bank (PDB)Binding residues charactersExposed solventsHigher electrostatics potentialMore conservedHotspots as clusters of conserved residues

Structural properties (DNA-binding residue vs surface)Packing densitySurface curvatureB-factorResidue fluctuationHydrogen bond donor

http://www.rcsb.org/pdb/home/home.do

Feature-basedExtract effective features

Template-basedAlign template and retrieve the best match

Computational algorithms

Template!!

Feature-basedExtract effective features

Template-basedAlign template and retrieve the best match

Computational algorithms

Template!!

Feature-basedExtract effective features

Template-basedAlign template and retrieve the best match

Computational algorithms

Template!!

Features used in machine learningStructure-based

PSSM (position specific scoring matrix)Evolutionally conservationSolvent accessibilityLocal geometry (depth and protrusion index)Topological features

degree, closeness, betweenness, clustering coefficient

Relative position (distance to centroid)Statistical potential (Boltzmann distribution)

Sequence-based (more difficult than structure)Amino acid identityResidue physicochemical properties

polarity, secondary structure, molecular volume, codon diversity, electrostatic charge

Predicted structure (Not need 3D structure !!)

Features used in machine learning

Structure-basedPSSMRelative solvent accessibilityDepth and protrusion indexTopological featuresDistance to centroidStatistical potentials

Sequence-basedPSSMPredicted structuresAmino acid indicesStatistical potentials

𝑀𝐿𝑠𝑐𝑜𝑟𝑒=𝛼𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒+(1−𝛼 )𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒

Construct machine learning (SVM)

𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒 𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒

Used in image recognition, etc…Recognition of faces in the camera.

Template-based approach

Template!!

Used in image recognition, etc…Recognition of faces in the camera.

Template-based approach

Match!! Template!!

Template-based prediction

Template-basedStructural alignment and statistical potentialThe binding residue prediction will be conducted only if the target protein was considered as a DNA-binding protein.

312 templates were selected.

Network

Degree is a commonly used measure to reflect the local connectivity of a node.

Closeness is a global centrality metric used to determine how critical a residue is in a residue interaction network.

Betweenness of residue i is defined to be the sum of the fraction of shortest paths between all pairs of residues that pass through residue i.

Clustering coefficient (transitivity) quantifies how close its neighbors are to being a clique. Probability that the adjacent vertices of a vertex are connected.

Motif, hub, and community are also important…

Network sample; human protein interactome

Scale-freeSmall-worldCluster

Power law (Pareto distribution)

Bioinformatics. 2012 Jan 1;28(1):84-90.

Machine learning

Example; spam4601 samples, 57 parameters.Classification; spam or nonspam

Machine learningSupport vector machine (SVM)Decision treeRandomForestLogistic regressionLASSO (Elastic net and Ridge)Neural networks (Deep learning)

Evolutionary algorithmGaussian processingk nearest neighborClusteringBayesian networksAssociation rule learningInductive logic programming (ILP)

Support vector machine (SVM)

Make hyperplane to divide groups.Kernel method; non-linear to linearEasy to do.Much computational time.Tuning is very difficult.

Decision tree

Make many trees.Easy to understand graphically.Performance is not so good.

RandomForest

Make many decision trees.Much precise.A little time consumer.

Logistic regression

Many medical researchers use…Easy to use but tuning is very difficult.(to tell the truth…)

LASSO, Elastic net, and Ridge regression

𝛼={1⋮0LASSOElastic NetRidge

Least Absolute Shrinkage and Selection Operator

Neural networks

Artificial mammal brain (perceptron).Hidden multi-layer.

Deep learning is hot topic!!(hard to understand…)

http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test 1

One-leave out CV

Performance

SVM Tree RandomForest LASSO Elastic net Ridge Logistic nnet

Recall 0.917 0.872 0.927 0.894 0.892 0.852 0.893 0.930

Precision 0.948 0.914 0.954 0.932 0.926 0.926 0.930 0.935

F 0.932 0.893 0.940 0.913 0.911 0.887 0.911 0.932

MMC 0.890 0.826 0.902 0.858 0.856 0.821 0.856 0.888

Combine two approaches

𝐶 𝑠𝑐𝑜𝑟𝑒={𝛽𝑀 𝐿𝑠𝑐𝑜𝑟𝑒+(1− 𝛽)𝑇 𝐿𝑠𝑐𝑜𝑟𝑒

𝑀𝐿𝑠𝑐𝑜𝑟𝑒

if

𝑀𝐿𝑠𝑐𝑜𝑟𝑒=𝛼𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒+(1−𝛼 )𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒

and are determined by CV and ROC analysis.

A: Binding residues are highly solvent accessible.B, C: Binding residues have low depth and high protrusion.D-G: Not so much difference in networks.H: Binding residues are less distant to the centroid.

Statistical features of structure

Performance

Performance

Proteins. 2004 Dec 1;57(4):702-10.Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.

TM-score is a measure of similarity between two protein structures with different tertiary structures. < 0.2 is random relation and > 0.5 is highly related.

Higher TM score is required for good prediction.

PerformanceComparison among ML, TL, and DNABind.

Comparison between DNABind and other software.

Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding

residues.

Query protein, Template protein, TP, FP, FN

Machine learning Template DNABind

CprK

(3E6

C:C)

EcoR

V(1R

VE:A

)

DNABind improves classification.

True positive residues.

top related