truncation of protein sequences for fast profile alignment with application to subcellular...
TRANSCRIPT
1
Truncation of Protein Sequences for Fast Profile Alignment with Application
to Subcellular Localization
Man-Wai MAK and Wei WANG The Hong Kong Polytechnic University
Sun-Yuan KUNGPrinceton University
2
Contents
1. Introduction– Cell Organelles and Proteins Subcellular Localization– Signal-Based vs. Homology-Based Methods
2. Speeding Up the Prediction Process– Predicting Cleaving Site Location– Truncating Profiles vs. Truncating Sequences– Perturbational Discriminant Analysis
3. Experiments and Results4. Conclusions
3
Organelles• Cells have a set of organelles that are specialized for carrying out
one or more vital functions.• Proteins must be transported to the correct organelles of a cell to
properly perform their functions. • Therefore, knowing the subcellular localization is one step towards
understanding the functions of proteins.
4
Proteins and Their Subcellular Location
5
Subcellular Localization Prediction
Two key methods:1. Signal-based2. Homology-based
6
Signal-Based Method
Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008.
• The amino acid sequence of a protein contains information about its organelle destination.
• Typically, the information can be found within a short segment of 20 to 100 amino acids preceding the cleavage site.
• Signal-based methods (e.g. TargetP) can determine the cleavage site location
Cleavage site
7
Full-length Query Sequence
S(1)=KNKA···S(2)=KAKN···
··
S(N)=KGLL···
Full-length Training sequences
Align with each of the training
sequences ...
SVM classifier
N-dim alignment vector
Subcellular Location
1
N
Advantage:• Can predict sequences that do
not have cleavage sites.Drawback:• Given a query sequence, we
need to align it with every training sequence in the training set, causing long computation time.
Homology-Based Method
8
21
8
Sequences Length Distribution
• Many sequences are fairly long, thus, aligning the whole sequence will take long computation time.
• cTP, mTP and SP are under 100 AAs only and contain the most relevant segment.
• Computation saving can be achieved by aligning the signal segments only.
Length distribution of Seq.
Sequence Length
SP
820
Ext:
Mit:
Chl:
35
mTP
1050
18
cTP
760
Cleavage Site
Cleavage Site
Cleavage Site
9
Proposed Method: Aligning the Segments that Contain the Most Relevant Info.
Signal-based Cleavage Site Predictor(e.g. TargetP)
N
truncate
Homology-based Method
SubcellularLocation
CAmino Acid Sequence…
Truncated sequence
Cleavage Site
10
Aligning Profiles Vs. Aligning Sequences
shortprofilesequence
shortCut SVM or
KPDAPairwise
AlignmentPSI-
BLASTSubcellular
Location
ScoreVector
shortprofileprofile
LongPSI-BLAST
SVM or KPDA
PairwiseAlignmentCut
SubcellularLocation
ScoreVector
Scheme I
Scheme II
shortprofilesequence
shortCut SVM or
KPDAPairwise
AlignmentPSI-
BLASTSubcellular
Location
ScoreVector
shortprofileprofile
LongPSI-BLAST
SVM or KPDA
PairwiseAlignmentCut
SubcellularLocation
ScoreVector
Scheme I
Scheme II
QuerySequence
Scheme I : Truncate the profilesScheme II : Truncate the sequences
11
Perturbational Discriminant Analysis
Input Space Hilbert Space
Input and Hilbert Spaces:
Empirical Space:
Nxk )(),( 1 xxK
Empirical Space
12
Perturbational Discriminant Analysis
• The objective of PDA is to find an optimal discriminant function in the Hilbert space or empirical space:
• The optimal solution (see derivation in paper) in the empirical space is
• ρ represents the noise (uncertainty) level in the measurement. It also ensures numerical stability of the matrix inverse.
• Ρ = 1 in this work.
13
Perturbational Discriminant Analysis
3 classes of 2-dim data in the input space
RBF kernal matrix K
Projection onto the 2-dim PDA space
Decision boundaries in the input space
Example on 2-D Data
14
Perturbational Discriminant Analysis
Application to Sequence Classification
Training sequences
PSI-BLASTPairwise
AlignmentComputePDA Para
TrainingProfiles K
Test sequence
PSI-BLASTAlign withTraining Profiles
ComputePDA Score
TestProfile
15
Perturbational Discriminant AnalysisApplication to Multi-Class Problems
1-vs-Rest PDA Classifier:
)(1 xf )(xfC
MAXNET
)(2 xf
x
16
Perturbational Discriminant Analysis
Application to Multi-Class Problems
Cascaded PDA-SVM Classifier:
Test sequence Project onto
(C–1)-dimPDA space
1-vs-restSVM
Classifier
Class label
17
Experiments
Materials:
• Eukaryotic sequences extracted from Swiss-Prot 57.5• Ext, Mit, and Chl contain experimentally determined cleavage sites• 25% Sequence identity (based on BLASTclust)
Performance Evaluation:
• 5-Fold cross validation• Prediction accuracy and Matthew’s correlation coefficient (MCC)
18
shortprofilesequence
shortCut SVM or
KPDAPairwise
AlignmentPSI-
BLASTSubcellular
Location
ScoreVector
shortprofileprofile
LongPSI-BLAST
SVM or KPDA
PairwiseAlignmentCut
SubcellularLocation
ScoreVector
Scheme I
Scheme II
shortprofilesequence
shortCut SVM or
KPDAPairwise
AlignmentPSI-
BLASTSubcellular
Location
ScoreVector
shortprofileprofile
LongPSI-BLAST
SVM or KPDA
PairwiseAlignmentCut
SubcellularLocation
ScoreVector
Scheme I
Scheme II
QuerySequence
Kernel matrix(Scheme I)
Kernel matrix(Scheme II)
Comparing Kernel Matrices
19
Sensitivity Analysis
• The localization performance degrades when the cut-off position drifts away from the ground-truth cleavage site.
• mTP and cTP are more sensitive
to the error of cleavage site prediction than Ext.
19Cut-off Position
p-16 p-8 p-2 p p+2 p+16 p+32 p+64
Ground-truth cleavage site
Cyt/Nuc
Overall
Mit
Chl
Ext
Cut Seq. at p±xp: gournd-truth cleave site
Subcellular localization(PairProSVM)
Subcellular location
Seq
Sub
cellu
lar
Loca
liatio
n A
ccur
acy
(%)
20
Performance of Cleavage Site Prediction
• Conditional Random Field (CRF) is better than TargetP(Plant) in terms of predicting the cleavage sites of signal peptide (Ext) but is worse than TargetP(Nonplant).
• CRF is slightly inferior to TargetP in predicting the cleavage sites of mitochondria, but it is significantly better than TargetP in predicting the cleavage site of chloroplasts.
20
Targe
tP(P
lant)
Targe
tP(N
onPlan
t)
CRF
Category
21
Findings: Profile creation time can be substantially reduced by truncating the protein sequences at the cleavage sites.
Comparing Profile Creation Time
shortprofilesequence
shortCut SVM or
KPDAPairwise
AlignmentPSI-
BLASTSubcellular
Location
ScoreVector
shortprofileprofile
LongPSI-BLAST
SVM or KPDA
PairwiseAlignmentCut
SubcellularLocation
ScoreVector
Scheme I
Scheme II
shortprofilesequence
shortCut SVM or
KPDAPairwise
AlignmentPSI-
BLASTSubcellular
Location
ScoreVector
shortprofileprofile
LongPSI-BLAST
SVM or KPDA
PairwiseAlignmentCut
SubcellularLocation
ScoreVector
Scheme I
Scheme II
QuerySequence
shortprofilesequence
shortCut SVM or
KPDAPairwise
AlignmentPSI-
BLASTSubcellular
Location
ScoreVector
shortprofileprofile
LongPSI-BLAST
SVM or KPDA
PairwiseAlignmentCut
SubcellularLocation
ScoreVector
Scheme I
Scheme II
shortprofilesequence
shortCut SVM or
KPDAPairwise
AlignmentPSI-
BLASTSubcellular
Location
ScoreVector
shortprofileprofile
LongPSI-BLAST
SVM or KPDA
PairwiseAlignmentCut
SubcellularLocation
ScoreVector
Scheme I
Scheme II
QuerySequence
22
Findings: The training time of 1-vs-rest PDA and Cascaded PDA-SVM are substantially shorter than that of SVM.
Training and Classification Time
)(1 xf )(xfC
MAXNET
)(2 xf
x
Project onto(C–1)-dim
PDA space
1-vs-restSVM
Classifier
23
Findings: In terms of localization accuracy, the proposed “Signal+Homology” method performs slightly better than the signal-based TargetP and is substantially better than the homology-based SubLoc.
Compare with State-of-the-Art Localization Predictors
ConditionalRandomFields
LocalizationAccuracy (%)
Subcellular localization(SubLoc/TargetP)
Cleavage site prediction(TargetP/CRF)
Subcellular localization(PairProSVM)
Subcellularlocation
Queryseq.
Subcellular localization(SubLoc/TargetP)
Cleavage site prediction(TargetP/CRF)
Subcellular localization(PairProSVM)
Subcellularlocation
Queryseq.
MCC
24
Conclusion
• Fast subcellular-localization-prediction can be achieved by a cascaded fusion of signal-based and homology-based methods.
• As far as localization accuracy is concerned, it does not matter whether we truncate the sequences or truncate the profiles. However, truncating the sequence can save the profile creation time by 6 folds.
24
25
Compare with State-of-the-Art Localization PredictorsSubcellular localization
(SubLoc/TargetP)
Cleavage site prediction(TargetP/CRF)
Subcellular localization(PairProSVM)
Subcellularlocation
Queryseq.
Subcellular localization(SubLoc/TargetP)
Cleavage site prediction(TargetP/CRF)
Subcellular localization(PairProSVM)
Subcellularlocation
Queryseq.
26
Performance of Cascaded Fusion
• The computation time for full-length profile alignment is a striking 116 hours
• Our method not only leads to nearly a 20 folds reduction in computation time but also boosts the prediction performance.
Full-length
Seq.
Seq. with Csite predicted by TargetP(P)
Seq. with Csite predicted by TargetP(N)
Seq. with Csite predicted by CRF
26
Time
Subcellular localization accuracy
Acc(%)
27
1) Cleavage site detection. The cleavage site (if any) of a query sequence is determined by a signal-based method.
2) Pre-sequence selection. The pre-sequence of the query is obtained by selecting from the N-terminal up to the cleavage site.
3) Pairwise alignment. The pre-sequence is aligned with each of the training pre-sequences to form an N-dim vector, which is fed to a one-vs-rest SVM classifier for prediction.
27
Fusion of Signal- and Homology-Based Methods
Signal-based Cleavage Site Predictor
N
truncate
Homology-based Method
SubcellularLocation
CAmino Acid Sequence
…
Pre-sequence
Cleavage Site
28
Perturbational Discriminant AnalysisSpectral Space:
Define the kernel matrix
K can be factorized via spectral decomposition into
Empirical Space
Nx )(e),( 1 xxK
Spectral Space