protein homology detection using string alignment kernels jean-phillippe vert, tatsuya akutsu
Post on 21-Dec-2015
218 views
TRANSCRIPT
![Page 1: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/1.jpg)
Protein Homology Detection Using String Alignment
Kernels
Jean-Phillippe Vert, Tatsuya Akutsu
![Page 2: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/2.jpg)
Problem: classification of protein sequence data into families and superfamilies
Motivation: Many proteins have been sequenced, but often structure/function remains unknown
Motivation: infer structure/function from sequence-based classification
Learning Sequence Based Protein Classification
![Page 3: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/3.jpg)
>1A3N:A HEMOGLOBIN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>1A3N:B HEMOGLOBIN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>1A3N:C HEMOGLOBIN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>1A3N:D HEMOGLOBIN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
Sequences for four chains of human hemoglobin
Tertiary Structure
Function: oxygen transport
Sequence Data Versus Structure and function
![Page 4: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/4.jpg)
SCOP: Structural Classification of Proteins
Interested in superfamily-level homology – remote evolutionary relationship
Difficult !!
Structural Hierarchy
![Page 5: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/5.jpg)
Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise
Focus on remote homology detection Use supervised learning approach to
train a classifier
Labeled TrainingSequences
Classification Rule
Learning Algorithm
Learning Problem
![Page 6: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/6.jpg)
Generative model approach Build a generative model for a single protein
family; classify each candidate sequence based on its fit to the model
Only uses positive training sequences
Discriminative approach Learning algorithm tries to learn decision
boundary between positive and negative examples
Uses both positive and negative training sequences
Two supervised learning approaches to classification
![Page 7: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/7.jpg)
Class
Fold
Super Family
Family
HMM, PSI-BLAST, SVM
SW, BLAST, FASTA
Threading
Secondary Structure Prediction
Targets of the current methods
![Page 8: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/8.jpg)
Discriminative approachTrain on both positive and negative examples to learn classifier
Modern computational learning theory• Goal: learn a classifier that generalizes well to new examples• Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”
Discriminative Learning
![Page 9: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/9.jpg)
Want to define feature map from space of protein sequences to vector space
Goals: Computational efficiency Competitive performance with known
methods No reliance on generative model –
general method for sequence-based classification problems
SVM for protein classification
![Page 10: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/10.jpg)
Feature vector from HMM Fisher kernel (Jaakkola et al., 2000) Marginalized kernel (Tsuda et al., 2002)
Feature vector from sequence Spectrum kernel (Leslie et al., 2002) Mismatch kernel (Leslie et al., 2003)
Feature vector from other score SVM pairwise (Liao & Noble, 2002)
Summary of the current kernel methods
![Page 11: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/11.jpg)
Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution.
It can not be used as kernel because of lack of positive definiteness.
A family of local alignment (LA) kernels that mimic SW score are presented .
String Alignment Kernels
![Page 12: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/12.jpg)
Choose Feature Vector representation
Get Kernel by inner product of vectors
Measure similarity Get valid kernel
LA Kernel
Other Kernels
LA Kernels
![Page 13: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/13.jpg)
Pair score Kaβ (x,y)
Gap kernel Kgβ (x,y) for penalty gap model
otherwisw)),(exp(
1||or 1||if0),(
yxs
yxyxKa
)1()(,0)0(
)|)(||)(|(exp),(
nedngg
ygxgyxKa
ただし、
with
d is gap opening and e is extension costs
Β>=0, s is a symmetric similarity score.
LA Kernels
![Page 14: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/14.jpg)
Kernel convolution:
For n>=1, the string kernel can be expressed as
yyyxxx
yxKyxKyxKK2121
2211,
2121 ),(),(),(
100
1
0)( ),(
Ka
n
gan KKKKKyxK ただし、 K0=1
K0 is initial part, succession of n aligned residues Ka β
with n-1 possible gap Kg β and a terminal part K0.
LA Kernels
![Page 15: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/15.jpg)
0
)( ),(),(i
iLA yxKyxK
It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels
V F
Ka β
F L L D D R L - - V L L V - - E K L G A - -
T T
Kg β Kg
β Kg β Ka β Ka β Ka β
LA Kernels
![Page 16: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/16.jpg)
π: local alignment p(x,y,π): score of local
alignment π over x,y. Π: set of all possible
local alignment over x,y.
)),,(exp(maxln
),,(max),(
),(
1
),(
yxp
yxpyxSW
yx
yx
),(
)),,(exp(),(yx
LA yxpyxK
),()),(ln(lim 1 yxSWyxKLA
LA with SW score
![Page 17: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/17.jpg)
1. SW only keep the best alignment instead of sum of alignment of x,y.
2. Logrithm can destroy the property of being postive definite.
Why SW can not be kernel
![Page 18: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/18.jpg)
AWGE A - GE
HAWGEG
AGEHV
配列 x
配列 y
SWスコア
LAカーネル
AWGE A - GE
AWGE AG - E
HAWGE A -G - E
HAWGE - G A -G EHV -
π 1
π 2
π 3
π 4
p(x,y,π )=0.003
p(x,y,π )=0.001
p(x,y,π )=0.0006
p(x,y,π )=0.0001
LA Kernel
SW score
Example
![Page 19: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/19.jpg)
SVM-pairwise LA kernel
Inner Product(0.9, 0.05, 0.3,
0.2)
0.227 0.253
Pair HMM
x y
x y
(0.2, 0.3, 0.1, 0.01)
SW Score
![Page 20: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/20.jpg)
It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM.
Diagonal Dominant Issue
![Page 21: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/21.jpg)
Diagonal Dominant Issue
(1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal.
(2) The empirical kernel map LA-ekm
![Page 22: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/22.jpg)
Implementation The computation of the kernel [and therefore of
] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm.
Normaliztion
Dataset 4352 sequences extracted from the Astral database (
www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.
Methods
![Page 23: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/23.jpg)
ROC Curve
![Page 24: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/24.jpg)
ROC Curve
![Page 25: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu](https://reader035.vdocuments.net/reader035/viewer/2022062516/56649d5a5503460f94a3a592/html5/thumbnails/25.jpg)
Summary for the kernels