![Page 1: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/1.jpg)
Protein tertiary structure prediction Protein tertiary structure prediction with new machine learning approacheswith new machine learning approaches
RuiRui KuangKuang
Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University
Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia)
NEC summer internship talk, August 30th, 2005NEC summer internship talk, August 30th, 2005
![Page 2: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/2.jpg)
AgendaAgenda
1. Introduction to protein structure2. Protein backbone angle prediction
with structured output learning3. Protein domain detection based on
protein structural classification4. Discussion
![Page 3: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/3.jpg)
1. Protein structure 2. Backbone angle prediction3. Domain detection4. Discussion
Part 1: Protein structurePart 1: Protein structure• Protein –
Derived from Greek word proteios meaning “of the first rank” in 1838 by Jöns J. Berzelius
• Crucial in all biological processes
• Function depends on structure (structure can help us to understand function)
• Determination of protein structures is time consuming and expensive
![Page 4: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/4.jpg)
How to describe protein How to describe protein structurestructure
• Primary structure: amino acid sequence• Secondary structure: local structure elements • Tertiary structure: packing and arrangement of
secondary structure, also called domain• Quaternary structure: arrangement of several
polypeptide chains
![Page 5: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/5.jpg)
Describe protein Describe protein tertiary structure structure by protein backbone anglesby protein backbone angles
Phi-Psi Angles
……
(Φ1,Ψ1)
(Φ2,Ψ2)
(Φ3,Ψ3)
(Φ4,Ψ4)
(Φ5,Ψ5)
(Φ6,Ψ6)
(Φ7,Ψ7)
(Φ8,Ψ8)
……
Simplify
3-D structure
(Too complicated to predict!)
![Page 6: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/6.jpg)
Oliver et al.(Journal of Molecular Biology, 1997)
DiscretizationDiscretization of Phiof Phi--PsiPsi angles:angles:conformational statesconformational states
![Page 7: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/7.jpg)
De Brevern et al.(Protein Science, 2002)
Protein blocksProtein blocks16 small prototypes (a-p) of local protein structures of 5 residue length, clustered from Phi-Psi angles
Phi-Psi Angles
……
(Φ1,Ψ1)
(Φ2,Ψ2)
(Φ3,Ψ3)
(Φ4,Ψ4)
(Φ5,Ψ5)
(Φ6,Ψ6)
(Φ7,Ψ7)
(Φ8,Ψ8)
……
Protein Blocks
……
g
l
p
d
b
b
m
……
![Page 8: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/8.jpg)
Summary: representations Summary: representations of 3of 3--D protein structure D protein structure
Phi-Psi Angles
(Φ1,Ψ1)
(Φ2,Ψ2)
(Φ3,Ψ3)
(Φ4,Ψ4)
(Φ5,Ψ5)
(Φ6,Ψ6)
(Φ7,Ψ7)
(Φ8,Ψ8)
……
3-D structure Conformational States:
AAAGBBBBBBGEBBBB…
Protein Blocks:
ammmalpppmmlmlbb…
![Page 9: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/9.jpg)
Protein domainsProtein domains
• A polypeptide chain or a part of a polypeptide chain that can fold independently into a stable tertiary structure.
![Page 10: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/10.jpg)
1. Protein structure 2. Backbone angle prediction3. Domain detection4. Discussion
Part 2: Part 2: Prediction of protein backbone angle
with structured output learning
![Page 11: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/11.jpg)
Kuang, Leslie and Yang et al.(Bioinformatics, 2004)
NaNaïïve windowve window--based approachbased approachEncode each position independently with sequence information within a length-k window.
Conformational
States
A
A
A
B
B
B
B
G
G
E
B
B
B
B
B
A:-3 –4 –4 –4 –3 –4…..
A:0 –1 –1 3 –4 3 4 1…..
B:0 –1 2 1 –3 4 0 –1……
B:-2 –3 –4 –5 –2 4……
B:0 –3 –1 –2 –4 –1……
……
ToSVM
Predictions are independent.
We are neighbors! We have
dependency!
![Page 12: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/12.jpg)
22--Stage windowStage window--based approachbased approach• Take the prediction of the naïve window-based
approach as input to a second sets of SVMs. • Ideally this smoothing step can correct some
wrong predictions.
Window-basedmapping
Naïve window-based approach
Sequence profiles SVM SVMA B G E
0.1 0.3 0.4 0.2
0.8 0.1 0.1 0.0
0.1 0.5 0.4 0.0
0.3 0.5 0.1 0.1……
Prediction profiles
Window-basedmapping
Final Predictions
AABGBEGE………
Predictions
A second stage of smoothing
![Page 13: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/13.jpg)
Mohr and Obermayer et al. (NIPS ,2004)
Topographic SVMTopographic SVM
Sequence profiles
A B G E
1.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 1.0 0.0 0.0……
True Label profiles
+ Window-basedmapping
SVM
Training
Update predictions
A B G E
0.1 0.3 0.4 0.2
0.8 0.1 0.1 0.0
0.1 0.5 0.4 0.0
0.3 0.5 0.1 0.1……
Prediction profiles
Sequence profiles
+Window-based
mapping
Testing
From a base
approach for
the first round
• Training with profiles+true labels• iteratively update the predictions in the testing phase.
![Page 14: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/14.jpg)
Tsochantaridis(ICML, 2004)
StructStruct--SVMSVM
•Testing: a pre-image problem
•Training: make joint feature mapping and apply large margin principle for the difference between the feature mapping of correct label and of wrong label.
•This is equivalent to the following optimization problem
![Page 15: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/15.jpg)
Altun et al. (ICML 2003)
PrePre--image for Labeling image for Labeling SequencesSequences
• Hidden-Markov kernel
• Pre-image is equivalent to Viterbi-decoding of a HMM built from support vectors
![Page 16: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/16.jpg)
Preliminary ResultsPreliminary Results•Prediction of Conformational States: 697 sequences of 97,365 amino acids with sequence identity < 25 %
•Prediction of Protein Blocks: 675 sequences of 146,978 amino acids with sequence identity < 30 %
50%>70.0%>SVM for structured output
58.4%75.3%Topographic SVM
59.5%>76.0%>2-Stage window-based approach
57.7%75.0%Naïve window-based approach
40.3%75.0%State of art
Accuracy(Protein Blocks)
Accuracy(Conformational States)
Methods
![Page 17: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/17.jpg)
1. Introduction 2. Backbone angle prediction3. Domain detection4. Discussion
Part 3: Part 3: Protein domain detection based on Protein domain detection based on
protein structural classificationprotein structural classification
![Page 18: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/18.jpg)
Murzin et al. (Journal of Molecular Biology, 1995)
Protein structural classificationProtein structural classification
SCOP
Fold
Superfamily
FamilyPositive Training Set
Positive Test Set
Negative Training Set
Negative Test Set
Family : Sequence identity > 30% or functions and structures are very similarSuperfamily : low sequence similarity but functional features suggest probable common evolutionary originCommon fold : same major secondary structures in the same arrangement with the same topological connections
![Page 19: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/19.jpg)
Leslie et al. (PSB, 2002)
Spectrum kernelSpectrum kernel• Feature map indexed by all possible k-length
subsequences (“k-mers”) from alphabet Σ of amino acids, |Σ| = 20
Q1:AKQDYYYYE
AKQKQDQDYDYYYYYYYYYYE
DYYYYEYEIEIAIAKAKQKQY
Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQY 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0
Q2:DYYEIAKQY
K(Q1,Q2)=<(…1…1…0…0…1…0…1…0…1…2),(…1…1…1…1…0…1…0…1…1…0)>=3
![Page 20: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/20.jpg)
Kuang and Leslie et al.(JBCB, 2005)
Profile kernelProfile kernel• Use profile to define position-dependent mutation neighborhoods:
• E.g. k=3, σ=5 and a profile of negative log probabilities
P(x) = p j (b),b∈ Σ, j =1K x{ }
AKQYKQ(2+1+1<σ) AKQ
(1+1+1<σ)AKC
(1+1+2<σ)
YKC(2+1+1<σ)
( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )AKC AKQ YKC YKQ
AKQ
A K Q …A 1 3 4 …C 5 4 1 …D 4 4 4 …… … … … …K 4 1 4 …… … … … …Q 3 4 1 …… … … … …Y 2 4 3 …
M k ,σ( ) P x j + 1: j + k[ ]( )( )=b1b2Lbk :− log p j + i bi( )( )i∑ <σ{ }
![Page 21: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/21.jpg)
Positional classification scoresPositional classification scores
PDB ID: 1f2e PDB ID:1hnf
A simple probabilistic model to detect domains:
)|||,1(*)|,(*...*)|1,1(*)|,(
*)|1,1(*)|,(*)|1,()|,(
3222
211110
FFePFesPFSePFesP
FsePFesPFssPFESP
nnn +−+
−+−=
![Page 22: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/22.jpg)
ExperimentsExperiments1.Dataset
• 7,329 sequences from SCOP 1.59.
• Sequence identity less than 95%.
2.Preliminary Results (with a simplified model)
21.9%31.1%Domain end
36.0%51.1%Domain start
73.1%73.2%Domain positions
CoverageAccuracyCriteria
![Page 23: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/23.jpg)
1. Introduction 2. Backbone angle prediction3. Domain detection4. Discussion
Part 4: DiscussionPart 4: Discussion
•Dependency between conformational states or protein blocks does not help much in the 2-stage window-based approach.
•Struct-SVM does not scale very well for large problems. Perceptron training may speed up the training stage.
•A proper probabilistic model is needed for detecting domain boundaries from positional classification scores
![Page 24: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches](https://reader034.vdocuments.net/reader034/viewer/2022051901/5ff085940166747183792cac/html5/thumbnails/24.jpg)
AcknowledgementAcknowledgement•William Stafford Noble
Genome Science Department, University of Washington•Asa Ben-Hur
Genome Science Department, University of Washington
•An-Suei YangGenome Research Center, Academia Sinica of Taiwan
•Yasemin AltunToyota Technological Institute at Chicago
•Thorsten JoachimsComputer Science Department, Cornell University