truncation of protein sequences for fast profile alignment with application to subcellular...

28
1 Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University

Upload: milo-fleming

Post on 28-Dec-2015

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

1

Truncation of Protein Sequences for Fast Profile Alignment with Application

to Subcellular Localization

Man-Wai MAK and Wei WANG The Hong Kong Polytechnic University

Sun-Yuan KUNGPrinceton University

Page 2: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

2

Contents

1. Introduction– Cell Organelles and Proteins Subcellular Localization– Signal-Based vs. Homology-Based Methods

2. Speeding Up the Prediction Process– Predicting Cleaving Site Location– Truncating Profiles vs. Truncating Sequences– Perturbational Discriminant Analysis

3. Experiments and Results4. Conclusions

Page 3: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

3

Organelles• Cells have a set of organelles that are specialized for carrying out

one or more vital functions.• Proteins must be transported to the correct organelles of a cell to

properly perform their functions. • Therefore, knowing the subcellular localization is one step towards

understanding the functions of proteins.

Page 4: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

4

Proteins and Their Subcellular Location

Page 5: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

5

Subcellular Localization Prediction

Two key methods:1. Signal-based2. Homology-based

Page 6: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

6

Signal-Based Method

Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008.

• The amino acid sequence of a protein contains information about its organelle destination.

• Typically, the information can be found within a short segment of 20 to 100 amino acids preceding the cleavage site.

• Signal-based methods (e.g. TargetP) can determine the cleavage site location

Cleavage site

Page 7: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

7

Full-length Query Sequence

S(1)=KNKA···S(2)=KAKN···

··

S(N)=KGLL···

Full-length Training sequences

Align with each of the training

sequences ...

SVM classifier

N-dim alignment vector

Subcellular Location

1

N

Advantage:• Can predict sequences that do

not have cleavage sites.Drawback:• Given a query sequence, we

need to align it with every training sequence in the training set, causing long computation time.

Homology-Based Method

Page 8: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

8

21

8

Sequences Length Distribution

• Many sequences are fairly long, thus, aligning the whole sequence will take long computation time.

• cTP, mTP and SP are under 100 AAs only and contain the most relevant segment.

• Computation saving can be achieved by aligning the signal segments only.

Length distribution of Seq.

Sequence Length

SP

820

Ext:

Mit:

Chl:

35

mTP

1050

18

cTP

760

Cleavage Site

Cleavage Site

Cleavage Site

Page 9: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

9

Proposed Method: Aligning the Segments that Contain the Most Relevant Info.

Signal-based Cleavage Site Predictor(e.g. TargetP)

N

truncate

Homology-based Method

SubcellularLocation

CAmino Acid Sequence…

Truncated sequence

Cleavage Site

Page 10: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

10

Aligning Profiles Vs. Aligning Sequences

shortprofilesequence

shortCut SVM or

KPDAPairwise

AlignmentPSI-

BLASTSubcellular

Location

ScoreVector

shortprofileprofile

LongPSI-BLAST

SVM or KPDA

PairwiseAlignmentCut

SubcellularLocation

ScoreVector

Scheme I

Scheme II

shortprofilesequence

shortCut SVM or

KPDAPairwise

AlignmentPSI-

BLASTSubcellular

Location

ScoreVector

shortprofileprofile

LongPSI-BLAST

SVM or KPDA

PairwiseAlignmentCut

SubcellularLocation

ScoreVector

Scheme I

Scheme II

QuerySequence

Scheme I : Truncate the profilesScheme II : Truncate the sequences

Page 11: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

11

Perturbational Discriminant Analysis

Input Space Hilbert Space

Input and Hilbert Spaces:

Empirical Space:

Nxk )(),( 1 xxK

Empirical Space

Page 12: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

12

Perturbational Discriminant Analysis

• The objective of PDA is to find an optimal discriminant function in the Hilbert space or empirical space:

• The optimal solution (see derivation in paper) in the empirical space is

• ρ represents the noise (uncertainty) level in the measurement. It also ensures numerical stability of the matrix inverse.

• Ρ = 1 in this work.

Page 13: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

13

Perturbational Discriminant Analysis

3 classes of 2-dim data in the input space

RBF kernal matrix K

Projection onto the 2-dim PDA space

Decision boundaries in the input space

Example on 2-D Data

Page 14: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

14

Perturbational Discriminant Analysis

Application to Sequence Classification

Training sequences

PSI-BLASTPairwise

AlignmentComputePDA Para

TrainingProfiles K

Test sequence

PSI-BLASTAlign withTraining Profiles

ComputePDA Score

TestProfile

Page 15: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

15

Perturbational Discriminant AnalysisApplication to Multi-Class Problems

1-vs-Rest PDA Classifier:

)(1 xf )(xfC

MAXNET

)(2 xf

x

Page 16: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

16

Perturbational Discriminant Analysis

Application to Multi-Class Problems

Cascaded PDA-SVM Classifier:

Test sequence Project onto

(C–1)-dimPDA space

1-vs-restSVM

Classifier

Class label

Page 17: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

17

Experiments

Materials:

• Eukaryotic sequences extracted from Swiss-Prot 57.5• Ext, Mit, and Chl contain experimentally determined cleavage sites• 25% Sequence identity (based on BLASTclust)

Performance Evaluation:

• 5-Fold cross validation• Prediction accuracy and Matthew’s correlation coefficient (MCC)

Page 18: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

18

shortprofilesequence

shortCut SVM or

KPDAPairwise

AlignmentPSI-

BLASTSubcellular

Location

ScoreVector

shortprofileprofile

LongPSI-BLAST

SVM or KPDA

PairwiseAlignmentCut

SubcellularLocation

ScoreVector

Scheme I

Scheme II

shortprofilesequence

shortCut SVM or

KPDAPairwise

AlignmentPSI-

BLASTSubcellular

Location

ScoreVector

shortprofileprofile

LongPSI-BLAST

SVM or KPDA

PairwiseAlignmentCut

SubcellularLocation

ScoreVector

Scheme I

Scheme II

QuerySequence

Kernel matrix(Scheme I)

Kernel matrix(Scheme II)

Comparing Kernel Matrices

Page 19: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

19

Sensitivity Analysis

• The localization performance degrades when the cut-off position drifts away from the ground-truth cleavage site.

• mTP and cTP are more sensitive

to the error of cleavage site prediction than Ext.

19Cut-off Position

p-16 p-8 p-2 p p+2 p+16 p+32 p+64

Ground-truth cleavage site

Cyt/Nuc

Overall

Mit

Chl

Ext

Cut Seq. at p±xp: gournd-truth cleave site

Subcellular localization(PairProSVM)

Subcellular location

Seq

Sub

cellu

lar

Loca

liatio

n A

ccur

acy

(%)

Page 20: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

20

Performance of Cleavage Site Prediction

• Conditional Random Field (CRF) is better than TargetP(Plant) in terms of predicting the cleavage sites of signal peptide (Ext) but is worse than TargetP(Nonplant).

• CRF is slightly inferior to TargetP in predicting the cleavage sites of mitochondria, but it is significantly better than TargetP in predicting the cleavage site of chloroplasts.

20

Targe

tP(P

lant)

Targe

tP(N

onPlan

t)

CRF

Category

Page 21: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

21

Findings: Profile creation time can be substantially reduced by truncating the protein sequences at the cleavage sites.

Comparing Profile Creation Time

shortprofilesequence

shortCut SVM or

KPDAPairwise

AlignmentPSI-

BLASTSubcellular

Location

ScoreVector

shortprofileprofile

LongPSI-BLAST

SVM or KPDA

PairwiseAlignmentCut

SubcellularLocation

ScoreVector

Scheme I

Scheme II

shortprofilesequence

shortCut SVM or

KPDAPairwise

AlignmentPSI-

BLASTSubcellular

Location

ScoreVector

shortprofileprofile

LongPSI-BLAST

SVM or KPDA

PairwiseAlignmentCut

SubcellularLocation

ScoreVector

Scheme I

Scheme II

QuerySequence

shortprofilesequence

shortCut SVM or

KPDAPairwise

AlignmentPSI-

BLASTSubcellular

Location

ScoreVector

shortprofileprofile

LongPSI-BLAST

SVM or KPDA

PairwiseAlignmentCut

SubcellularLocation

ScoreVector

Scheme I

Scheme II

shortprofilesequence

shortCut SVM or

KPDAPairwise

AlignmentPSI-

BLASTSubcellular

Location

ScoreVector

shortprofileprofile

LongPSI-BLAST

SVM or KPDA

PairwiseAlignmentCut

SubcellularLocation

ScoreVector

Scheme I

Scheme II

QuerySequence

Page 22: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

22

Findings: The training time of 1-vs-rest PDA and Cascaded PDA-SVM are substantially shorter than that of SVM.

Training and Classification Time

)(1 xf )(xfC

MAXNET

)(2 xf

x

Project onto(C–1)-dim

PDA space

1-vs-restSVM

Classifier

Page 23: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

23

Findings: In terms of localization accuracy, the proposed “Signal+Homology” method performs slightly better than the signal-based TargetP and is substantially better than the homology-based SubLoc.

Compare with State-of-the-Art Localization Predictors

ConditionalRandomFields

LocalizationAccuracy (%)

Subcellular localization(SubLoc/TargetP)

Cleavage site prediction(TargetP/CRF)

Subcellular localization(PairProSVM)

Subcellularlocation

Queryseq.

Subcellular localization(SubLoc/TargetP)

Cleavage site prediction(TargetP/CRF)

Subcellular localization(PairProSVM)

Subcellularlocation

Queryseq.

MCC

Page 24: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

24

Conclusion

• Fast subcellular-localization-prediction can be achieved by a cascaded fusion of signal-based and homology-based methods.

• As far as localization accuracy is concerned, it does not matter whether we truncate the sequences or truncate the profiles. However, truncating the sequence can save the profile creation time by 6 folds.

24

Page 25: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

25

Compare with State-of-the-Art Localization PredictorsSubcellular localization

(SubLoc/TargetP)

Cleavage site prediction(TargetP/CRF)

Subcellular localization(PairProSVM)

Subcellularlocation

Queryseq.

Subcellular localization(SubLoc/TargetP)

Cleavage site prediction(TargetP/CRF)

Subcellular localization(PairProSVM)

Subcellularlocation

Queryseq.

Page 26: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

26

Performance of Cascaded Fusion

• The computation time for full-length profile alignment is a striking 116 hours

• Our method not only leads to nearly a 20 folds reduction in computation time but also boosts the prediction performance.

Full-length

Seq.

Seq. with Csite predicted by TargetP(P)

Seq. with Csite predicted by TargetP(N)

Seq. with Csite predicted by CRF

26

Time

Subcellular localization accuracy

Acc(%)

Page 27: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

27

1) Cleavage site detection. The cleavage site (if any) of a query sequence is determined by a signal-based method.

2) Pre-sequence selection. The pre-sequence of the query is obtained by selecting from the N-terminal up to the cleavage site.

3) Pairwise alignment. The pre-sequence is aligned with each of the training pre-sequences to form an N-dim vector, which is fed to a one-vs-rest SVM classifier for prediction.

27

Fusion of Signal- and Homology-Based Methods

Signal-based Cleavage Site Predictor

N

truncate

Homology-based Method

SubcellularLocation

CAmino Acid Sequence

Pre-sequence

Cleavage Site

Page 28: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic

28

Perturbational Discriminant AnalysisSpectral Space:

Define the kernel matrix

K can be factorized via spectral decomposition into

Empirical Space

Nx )(e),( 1 xxK

Spectral Space