biological sequence classification with multivariate string kernels

Biological Sequence Classificationwith Multivariate String Kernels

Pavel P. Kuksa

Abstract—String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data

analysis. They often exhibit state-of-the-art performance on many practical tasks of sequence analysis such as biological sequence

classification, remote homology detection, or protein superfamily and fold prediction. However, typical string kernel methods rely on

the analysis of discrete 1D string data (e.g., DNA or amino acid sequences). In this paper, we address the multiclass biological

sequence classification problems using multivariate representations in the form of sequences of features vectors (as in biological

sequence profiles, or sequences of individual amino acid physicochemical descriptors) and a class of multivariate string kernels that

exploit these representations. On three protein sequence classification tasks, the proposed multivariate representations and kernels

show significant 15-20 percent improvements compared to existing state-of-the-art sequence classification methods.

Index Terms—Biological sequence classification, kernel methods

Ç

1 INTRODUCTION

ANALYSIS of large-scale sequential data has become animportant task in machine learning and data mining,

inspired in part by numerous scientific and technologicalapplications such as the biomedical literature analysis orthe analysis of biological sequences. The classification ofstring data, sequences of discrete symbols, has attractedparticular interest and has led to a number of newalgorithms [1], [2], [3], [4], [5]. These algorithms oftenexhibit state-of-the-art performance on tasks such asprotein superfamily and fold prediction [2], [4], [6], [7], orDNA sequence analysis [8].

A family of state-of-the-art kernel-based approaches tosequence modeling relies on fixed length, substring spectralrepresentations of sequences and the notion of mismatchkernels (see [2], [3]). There, a sequence is representedas the spectra (histogram of counts) of all short substrings(k-mers) contained within a sequence. The similarity scoreKðX;Y Þ for pair of sequences X and Y is established byexact or approximate matches of k-mers contained in Xand Y . Initial work, for example, [3], [9], has demonstratedthat this similarity can be computed using trie-basedapproaches in Oðkmþ1j�jmðjXj þ jY jÞÞ, for strings X andY with symbols from alphabet � and up to m mismatchesbetween k-mers. More recently, Kuksa et al. [10] introducedlinear time algorithms with alphabet-independent com-plexity applicable to the computation of a large class ofexisting string kernels.

However, typical spectral models (e.g., mismatch/spectrum kernels, gapped and wildcard kernels [6], [3])essentially rely on symbolic Hamming-distance-based match-ing of 1D k-mers contained in the sequences. For example,

given 1D sequences X and Y over alphabet � (e.g., aminoacid sequences with j�j ¼ 20), the spectrum-k kernel [11]and the mismatch-(k,m) kernel [3] measure similaritybetween sequences as

KðX;Y j k;mÞ ¼X�2�k

�k;mð� j XÞ�k;mð� j Y Þ

¼X�2X

X�2Y

X�2�k

Imð�; �ÞImð�; �Þ;ð1Þ

where

�k;mð� j XÞ ¼�X�2X

Imð�; �Þ�

ð2Þ

is the number of occurrences (possibly with up to mmismatches) of the k-mer � in X, and Imð�; �Þ ¼ 1 if � is inthe mutational neighborhood Nk;mð�Þ of �, i.e., � and � areat the Hamming distance of at most m. One interpretationof this kernel (1) is that of cumulative Hamming distance-based pairwise comparison of all k-long substrings � and �contained in sequences X and Y , respectively. In the caseof mismatch kernels, the level of similarity of each pair ofsubstrings ð�; �Þ is based on the number of identicalsubstrings their mutational neighborhoods Nk;mð�Þ andNk;mð�Þ give rise to,

P�2�k Imð�; �ÞImð�; �Þ. For the spec-

trum kernel, this similarity is simply the exact matching of� and �.

We note that existing k-mer string kernels essentially useonly 1D discrete sequences (e.g., amino acid or nucleotidesequences) and Hamming-based matching.

However, as shown in previous works, using other,multidimensional, protein sequence representations is crucialin obtaining more accurate and robust predictions. In part,low sequence identities among distantly related proteinswith similar structures and functions motivated the use ofthese other multidimensional amino-acid physicochemicalrepresentations as physical and chemical properties ofprotein chains may preserve better among these otherwise

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 1201

. The author is with the Machine Learning Department, NEC LaboratoriesAmerica, Inc., Princeton, NJ 08540. E-mail: [email protected].

Manuscript received 30 Aug. 2012; revised 18 Feb. 2013; accepted 25 Feb.2013; published online 6 Mar. 2013.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTCBBSI-2012-08-0229.Digital Object Identifier no. 10.1109/TCBB.2013.15.

1545-5963/13/$31.00 � 2013 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

very dissimilar primary protein sequences. For instance,20D sequence profiles, as in the profile kernel method [2], oramino-acid descriptor vectors, for example, as in recentworks [12], [13], [14], can provide significantly more accurateresults on a number of biological problems, includingstructural classification of proteins (SCOP), protein remotehomology detection [13], [14], protein function prediction[15], [16], protein subcellural localization [17], protein-protein binding prediction [18], [19], and so on.

In cases of multidimensional sequence representationsmentioned above (sequence profiles or sequences of amino-acid descriptors), input data occurs in the form of sequencesof RD (real-valued) feature vectors (i.e., multivariate sequences).

In this paper, we consider an approach that directlyexploits these richer R-dimensional multivariate sequences(sequence profiles and amino-acid descriptor sequences, inparticular) and propose general, simple discrete multivariaterepresentations of sequences (Sections 3.1 and 3.2). Theproposed class of multivariate similarity kernels allowsefficient inexact matching and classification of the multi-variate discrete sequences (i.e., sequences of R-dimensionaldiscrete feature vectors) (see Section 3.3). The developedapproach is applicable to both discrete- and continuous-valuedoriginal sequences, such as biological sequence profiles, orsequences of amino-acid descriptors. Experiments using thenew multivariate string kernels on protein remote homologydetection and fold prediction show excellent predictiveperformance (see Section 4) with significant (15-20 percent)improvements in predictive accuracy over existing state-of-the-art sequence classification methods.

2 RELATED WORK

Over the past decade, various methods have been proposedto solve the sequence classification problem, includinggenerative, such as HMMs, or discriminative approaches.Among the discriminative approaches, in many sequenceanalysis tasks, string kernel-based [20] methods providesome of the most accurate results [2], [3], [5], [6], [4], [21].

The key idea of basic string kernel methods is to mapsequences of variable length into a fixed-dimensional vectorspace using feature map �ð�Þ. In this space, a standardclassifier such as a support vector machine (SVM) [20] canthen be applied. As SVMs require only inner productsbetween examples in the feature space rather than thefeature vectors themselves, one can define a string kernel,which computes the inner product in the feature spacewithout explicitly computing the feature vectors:

KðX;Y Þ ¼ h�ðXÞ;�ðY Þi; ð3Þ

where X;Y 2 D, D is the set of all sequences composed ofelements that take on a finite set of possible values from thealphabet �.

Sequence matching is frequently based on the cooccur-rence of exact subpatterns (k-mers, features), as in spectrumkernels [11] or substring kernels [22]. Inexact comparison inthis framework is typically achieved using different familiesof mismatch [3] or profile [2] kernels. Both spectrum-k andmismatch-(k,m) kernels directly extract string features fromthe observed sequence, X. On the other hand, the profilekernel, proposed by Kuang et al. [2], builds a 20� jXj

profile [23] PX and then uses a similar j�jk-dimensionalrepresentation, now derived from PX.

Most of existing string kernel methods essentiallyamount to the analysis of 1D sequences over finite alphabets� with 1D k-mers as basic sequence features (as in, e.g.,spectrum/mismatch [11], [3], substring [22], gapped orwildcard kernels [6]). However, sequences can often berepresented in the form of sequences of feature vectors, i.e.,each input sequence X is a sequence of R-dimensional featurevectors, which could be considered as R� jXj feature matrix(i.e., multivariate or 2D sequence). For example, proteinsequences could be considered as sequences of R-dimen-sional feature vectors (multivariate) describing physical/chemical properties of individual amino acids (e.g., asin [12]), or as sequence profiles (e.g., as in [2] and [7])describing each position as a probability distribution overamino-acid residues.

A number of methods have been proposed previouslythat use amino-acid descriptors (feature vectors), includingearlier works on protein function prediction (e.g., SVM-prot[16], ProtFun [24], DMP [15]), protein-peptide bindingprediction [18], [19] using concatenation of 5D amino-aciddescriptors as peptide representations, protein subcellularlocalization [17] using amino-acid kernel derived fromamino-acid substitution matrices, prediction of proteinfolding class [25] using peptide-chain descriptors forcomposition and distribution of amino acid attributes,protein remote homology detection using AAindex attri-butes [14], [13] Recent work [12] uses BLOSUM descriptorsto define k-mer similarity measure and shows improve-ments over traditionally used string kernels.

In contrast to previous work that either uses amino-acidfeature vectors to obtain fixed-length global sequencedescriptors (as in, e.g., [19], [25], and [16]), or limits theuse of amino-acid descriptors to define matching functionfor k-mers (e.g., [12], [17]), we propose novel methods thatdirectly exploit these multivariate sequence representations(e.g., sequence of amino-acid descriptors, or sequenceprofiles) and allow to capture similarity/difference in bothdimensions: along the protein-chain and along the featuredimension (cumulative feature-chain similarity).

To this extent, we propose discrete and binary multi-variate protein sequence representations (see Sections 3.1and 3.2) and consider a family of multivariate similaritykernels (see Sections 3 and 3.3) that, as we show empirically(see Section 4), provide effective improvements in practiceover traditional 1D (univariate) sequence kernels for anumber of challenging biological problems, includingprotein remote homology detection, protein structural classclassification, and protein binding prediction.

3 MULTIVARIATE DISCRETE SEQUENCE METHOD

In a typical setting, string kernels are applied to 1D(univariate) string data, such as amino acid sequences orDNA sequences. In this paper, we consider alternativemultivariate representations for sequences (see Fig. 1a) assequences of R-dimensional feature vectors (e.g., sequences ofamino-acid descriptor vectors, or amino-acid sequenceprofiles). In particular, given as input description ofbiological sequences in the form of sequences of (real-valued)

1202 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013

identically sized amino-acid feature vectors, we consider thefollowing two discrete multivariate representations:

1. Symbolic embedding. Encoding original real-valuedR-dimensional feature vectors in discrete (binary)E-dimensional space using, for example, similarityhashing approach [26] (see Fig. 1a; left subfigure).

2. Direct feature quantization. Directly quantizing eachfeature using, for example, uniform binning (seeFig. 1a; right subfigure), i.e., representing original(real-valued) R� jXj feature sequence as R� jXjdiscrete sequence.

In both approaches, the (real-valued) multivariate(R� jXj) feature sequence X is re-represented as E � jXj

or jRj � jXj multivariate (2D) discrete feature sequence.Fig. 2 illustrates these representations for a given sequenceX ¼ ’SLFEQLGV’. In this figure, a 20D multivariaterepresentation for original 1D sequence of amino acids isobtained by replacing each individual amino acid with a20D vector of substitution BLOSUM62 scores, which arethen either directly discretized (see Fig. 2a) or transformedinto binary vectors using symbolic Hamming embedding(see Fig. 2b). The resulting multivariate representationsreflect underlying similarities among amino acids moreaccurately compared to symbolic 1D Hamming similarities.

We will show in the experiments that using thesediscrete multivariate representations can significantly (by 15-20 percent) improve predictive accuracy compared to

KUKSA: BIOLOGICAL SEQUENCE CLASSIFICATION WITH MULTIVARIATE STRING KERNELS 1203

Fig. 1. Proposed discrete multivariate representations.

Fig. 2. Examples of discrete multivariate representations for protein sequences using direct feature quantization or similarity hashing (binaryHamming embedding).

traditional 1D (univariate) kernel representations as well asother state-of-the-art approaches (see Section 4).

In the following, we will discuss these proposedrepresentation approaches in detail.

3.1 Direct Feature Quantization

In this approach, each feature fj; j ¼ 1 . . .R is quantized bydividing its range ðfjmin; fjmaxÞ into a finite number ofintervals. In the simplest case, the intervals can be defined,for instance, using uniform quantization, where the entirefeature data range is divided into B equal intervals oflength � ¼ ðfmax � fminÞ=B and the index of quantizedfeature value QðfÞ ¼ ðf � fminÞ=� is used to representthe feature value f . Partitioning the feature data rangecould also be obtained by using 1D clustering, for example,k-means, to adaptively choose dicretization levels.

In the experiments (see Section 4), we use this approachwith 20� jXj sequence profiles obtained from PSI-BLASTand 20� jXj BLOSUM substitution profiles, and show howusing these multivariate representations with multivariatesimilarity kernels (see Section 3.3) can improve thepredictive ability of classifiers on a number of proteinsequence classification tasks.

3.2 Discrete (Symbolic) Embedding

Given multivariate input sequence X ¼ x1; . . . ; xn of R-dimensional feature vectors, each R-dimensional vector canbe mapped into discrete feature vectors using symbolicembedding Eð�Þ as in, for example, similarity hashing [26].Using similarity hashing, input sequence X ¼ x1; . . . ; xn ofR-dimensional feature vectors, xi 2 RR 8i, is mappedintro a binary Hamming-space embedded sequence

EðXÞ ¼ Eðx1Þ; . . . ; EðxnÞ;

where EðxiÞ ¼ ei1ei2 . . . eiB is a symbolic Hamming embed-ding for item xi in X, with jEðxiÞj ¼ B, the number of bits ina resulting binary embedding of xi. This embedding asproposed in [26] essentially aims to minimize averageHamming distance between binary embeddings corre-sponding to similar R-dimensional data points:

minX�;�

Sð�; �ÞhðEð�Þ; Eð�ÞÞ2; ð4Þ

where Sð�; �Þ is the similarity between data points � and �in the original R-dimensional space and hðEð�Þ; Eð�ÞÞ is theHamming distance between binary vectors Eð�Þ and Eð�Þ inthe Hamming embedded space.

The solution (set of embedding binary vectors Eð�Þ thatminimize the embedding objective in (4) can be obtained bysolving the eigenvalue problem as shown in [26] andthresholding eigenfunctions to obtain binary codes.

Under this binary Hamming embedding, the Hammingsimilarity, hðEð�Þ; Eð�Þ, between two B-dimensional fea-ture embeddings Eð�Þ and Eð�Þ is proportional to theoriginal similarity score Sð�; �Þ between R-dimensionalvectors � and �, i.e.,

hðEð�Þ; Eð�ÞÞ / Sð�; �Þ: ð5Þ

Using this Hamming embedding approach, the originalR� jXj (real-valued) feature sequence X is represented as

E � jXj binary feature sequence, which can then be usedwith the string kernel method.

3.3 Multivariate Similarity Kernels

We now introduce efficient multivariate similarity kernels forthe discrete multivariate sequence representations definedin Section 3.

The input sequences X and Y are sequences ofidentically sized (e.g., discrete R-dimensional or binaryB-dimensional) amino-acid feature vectors (see Fig. 2).Similarity evaluation between the two sequences X and Y

under multivariate (2D) representations amounts to compar-ing pairs of k�R (k�B) 2D submatrices contained in X

and Y , where k is the length of 2D-k-mer (i.e., k is similarto the length of the k-mer in typical k-mer-based univariatekernels). A multivariate string kernel can then be definedfor the multivariate (R-dimensional or B-dimensional)sequences X and Y as

K2DðX;Y Þ ¼X�2D2X

X�2D2Y

Kð�2D; �2DÞ; ð6Þ

where �2D and �2D are jRj � k (or jEj � k) submatrices(2D ¼ k-mers) ofX andY andKð�2D; �2DÞ is a kernel functiondefined for measuring similarity between two submatrices.The multivariate kernel in (6) corresponds to cumulativepairwise comparison of submatrices (2D-k-mers) containedin multivariate sequences X and Y (this is similar to typicalspectrum kernels, for example, mismatch kernel in (1)).

One possible definition for the submatrix kernel Kð�; �Þ in(6) is row-based similarity function

Kð�2D; �2DÞ ¼XRi¼1

I��i2D; �

i2D

�; ð7Þ

where Ið. . . ; . . .Þ is a similarity/indicator function formatching 1D rows �i2D and �i2D. The matching functionIð�; �Þ could be defined as Ið�; �Þ ¼ 1 if the Hammingdistance dð�; �Þ � m, and 0 otherwise (i.e., similar to themismatch kernel). In the experiments, we use spectrum,mismatch [3], and spatial sample (SSSK) [4] kernel matchingfunctions as our 1D row matching function Ið�i2D; �i2DÞ in(7), which results in corresponding multivariate spectrum,mismatch, and spatial sample kernels (referred as 2D-Spectrum, 2D-Mismatch, and 2D-SSSK, respectively).

Intuitively, according to the kernel definition (7), similarsubmatrices (2D-k-mers) (i.e., submatrices with manysimilar rows) will result in high kernel value Kð�; �Þ.

Using (7), the multivariate (2D) kernel in (6) can bewritten as

K2DðX;Y Þ ¼XRi¼1

X�2D2X

X�2D2Y

I��i2D; �

i2D

�; ð8Þ

which can be efficiently computed by running spectrumkernel with 1D k-mer matching function Ið�; �ÞR times, i.e.,for each row b ¼ 1; . . . ; R. The overall complexity ofevaluating multivariate kernel K2DðX;Y Þ for two multi-variate (R-dimensional) sequences X and Y is thenOðR � k � nÞ, i.e., is linear in sequence length n.


4 EXPERIMENTAL EVALUATION

We study the performance of our methods in terms ofpredictive accuracy on a number of challenging biologicalproblems using standard benchmark data sets for proteinsequence analysis.

4.1 Data Sets and Experimental Setup

We test proposed methods on a number of multiclassbiological sequence classification and prediction tasks:

1. Protein remote homology detection task. This tasks teststhe ability to build a classifier that would correctlyrecognize proteins from previously unseen proteinfamilies belonging to the target superfamily. We usetwo benchmark data sets (SCOP 1.53 and SCOP 1.59)that have been used by many previous works (e.g.,[27], [2], [13], [28]). These benchmark data sets arederived from the SCOP database, which aims tocategorize proteins into structural hierarchy (seeFig. 3) of classes, folds, superfamilies, and families.

2. Protein fold recognition task. The task here is tocorrectly recognize proteins from previously unseensuperfamilies under target protein fold. We use adata set with 4,671 sequences derived from SCOP(SCOP 1.73) to simulate fold recognition problem.

3. Multiclass protein fold classification. The task here is toclassify given protein into correct fold. For this task,we use two benchmark data sets. The first data set isa benchmark data set (Ding-Dubchak data set)containing 694 protein sequence from 27 proteinfolds [29], [7]. The second data set is a larger data setwith 3,860 protein sequences from 26 different folds[7]. For both data sets, the protein sequences aredivided into training and testing sets [29], [7].

4. MHC-peptide binding prediction. The goal of thisprediction task is to predict whether a given peptidewould bind to the target major histocompatibilitycomplex (MHC) protein molecule. We use thebenchmark data set proposed in [30] for this task.

We provide details of the tasks and benchmark data setsin Table 1. For each of the tasks, we now provide details ofdata sets, experimental settings, train/test procedures.

4.1.1 Protein Remote Homology Detection Data Set

For the remote protein homology prediction task, we followstandard experimental setup used in previous studies [27]and evaluate average classification performance (ROC50)

on 54 remote homology experiments, each simulating theremote homology detection problem by training a classifieron a subset of families (positive examples) under the targetsuperfamily and testing the superfamily classifier on theremaining (held-out) families from the target superfamilyaccording to SCOP hierarchy (see Fig. 3). Negative trainingand testing examples are chosen from protein familiesoutside of target superfamily fold [27], [11]. For example, asshown in the figure, the two families (1.a.1.1 and 1.a.1.2) fromthe target superfamily 1.a.1 will be used as positive trainingdata, while the held-out sequences from the third family(1.a.1.3) will be used for testing classifier’s performance.

4.1.2 Protein Fold Recognition Data Set

Ding and Dubchak [29] designed a challenging foldrecognition data set,1 which has been used as a benchmarkin many studies, for example [7]. The data set containssequences from 27 folds divided into two independent sets,such that the training and test sequences share less than35 percent sequence identities and within the training set, nosequences share more than 40 percent sequence identities.

4.1.3 Remote Protein Fold Detection Data Set

Melvin et al. [7] derived this data set from SCOP 1.65 [31]for the tasks of multiclass remote fold detection. The dataset contains 26 folds, 303 superfamilies, and 652 families fortraining with 46 superfamilies held out for testing to modelremote fold recognition setting.

4.1.4 MHC-Peptide Binding Prediction Data Set

The task here is to predict which peptides bind to a givenMHC protein molecule. Each type of MHC molecules(MHC alleles) have unique binding preferences and as aresult the repertoire of the binding peptides varies amongvarious MHC alleles. The task of predicting whether thepeptide is a binder or a nonbinder for a given MHC allele isimportant task in immunoinformatics and clinical research.We use the IEDB benchmark data set [30], which containsquantitative binding data (IC50 values) for short peptides(9-mers) with respect to various MHC alleles. Peptides withIC50 � 500 are considered to be nonbinding, while peptideswith IC50 < 500 are considered to be binding.

4.2 Baseline Methods

We compare our approach with a number of other state-of-the-art methods for sequence classification, both in fullysupervised and semi-supervised settings, including spec-trum/mismatch [11], [3], spatial sample kernels [4],substitution [6], and semi-supervised profile kernels [2].

The spectrum/mismatch kernel for two sequences Xand Y corresponds to cumulative pairwise comparison ofk-mers contained in X and Y

KðX;Y Þ ¼X�2X

X�2YSmð�; �Þ;

where the k-mer matching/similarity function

Smð�; �Þ ¼X�2�k

Imð�; �ÞImð�; �Þ;


Fig. 3. The SCOP hierarchy. Protein domains organized into classes,folds, superfamilies, and families. Protein sequences from the samesuperfamily but different families are considered remote homologs.

1. http://ranger.uta.edu/~chqding/bioinfo.html.

indicates the number of k-mers � at Hamming distance of at

most m from both � and � (Imð�; �Þ ¼ 1 if the Hamming

distance hð�; �Þ � m, and 0, otherwise).The substitution kernel [6] replaces Hamming distance-

based indicator function Imð�; �Þ with a scoring function

I�ð�; �Þ ¼ 1;�Xki¼1

log pð�i j �iÞ < �

0; otherwise:

8><>:

(pð�i j �iÞ is a conditional substitution probability).The profile kernel [2]

KpðX;Y Þ ¼X�2�k

XjXj�kþ1

iX¼1

XjY j�kþ1

iY¼1

IpðpiX ðXÞ; �ÞIpðpiY ðY Þ; �Þ;

uses position-specific scoring function Ipðpi; �Þ over 20� jXjprofile pðXÞ ¼ p1 . . . pjXj

Ipðpi; �Þ ¼ 1;�Xkj

piþj�1ð�jÞ < �

0; otherwise;

8><>:

where pið�jÞ indicates emission probability from sequence

profile pðXÞ at position i. Compared to the substitution

kernel, the score Ipð�; �Þ for two k-mers � and � depends on

the positions of � and � within the sequences.The spatial sample kernel [4] is defined over spatial

features

a1$d1a2;$

d2; . . . ; !dt�1

at ;

representing all possible arrangements of t k-long sub-

strings with the maximum distance between two consecu-

tive substrings in the t-tuple constrained by the distance

parameter d. Inclusion of the spatial arrangement informa-

tion among k-mers results in better performance compared

to spectrum/mismatch kernels [4].

4.3 2D Representations for Protein Sequences

For remote homology and protein fold prediction tasks, we

use 20D BLOSUM amino acid substitution vectors for each

amino acid to obtain a 20� jXj 2D amino-acid sequence

representation for each 1D amino acid sequence X (i.e., we

replace each individual amino acid i in X with a 20D

vector bi;j; j ¼ 1 . . . j�j ¼ 20, of substitution scores for

amino acids i and j). We also use 5D quantitative

descriptors [32] derived from 237 physicochemical amino

acid properties. As we show in the experiments both of

these descriptors (20D BLOSUM and 5D quantitative

descriptors) improve predictive accuracy compared to 1Dsymbolic amino acid representations.

We also use 2D sequence profiles (20� jXj) and comparewith profile kernel approach [2].

4.4 2D kernels for Protein Sequences

We use state-of-the-art spectrum/mismatch [6] and spatial(SSSK) [4] kernels as our basic 1D similarity kernels (i.e., weuse spectrum/mismatch and SSSK similarity kernels asthe row-matching function Ið�; �Þ in (7)). Depending on thechoice of the 1D similarity kernel, we refer to correspondingmultivariate kernels as 2D-spectrum, 2D-mismatch, and2D-SSSK multivariate kernels. We use typical settings forkernel parameters and set k ¼ 5, m ¼ 1 for mismatchkernels, k ¼ 1, t ¼ 3, and d ¼ 5 for SSSK kernels. Parametersk, and � for the profile kernel are set to 5 and 7.5 accordingto the best found in previous studies [2].

For direct feature quantization, we use uniform quanti-zation of each feature data range into B ¼ 32 bins (duringtesting for values outside of the ðfmin; fmaxÞ range, we usespecial values of 0 and Bþ 1 for values smaller than fmin orlarger than fmax). For discrete embedding with similarityhashing, we use E ¼ 8 bits.

For all tasks, we use kernel SVM [20] as a classifier. Formulticlass problems, we use the one-versus-rest approachand train a binary classifier for each class.

All experiments are performed on a single 2.8-GHz CPU.The data sets used in our experiments and the supplemen-tary data/code are available at http://paul.rutgers.edu/~pkuksa/mvstring.html.

4.5 Evaluation Measures

For multiclass protein fold recognition tasks, the methodsare evaluated using 0-1 and top-q balanced error rates as wellas F1 scores. Under the top-q error cost function, aclassification is considered correct if the rank of the correctlabel, obtained by sorting all prediction confidences innonincreasing order, is at most q. On the other hand, underthe balanced error cost function, the penalty of misclassify-ing one sequence is inversely proportional to the number ofsequences in the target class (i.e., misclassifying a sequencefrom a class with a small number of examples results in ahigher penalty compared to that of misclassifying a sequencefrom a large, well represented class). Balanced error rates aremore indicative of the performance on protein classificationtasks because class sizes are unbalanced (vary in sizeconsiderably), and the balanced error captures performanceon all classes, not just the largest classes.

We evaluate remote protein homology performanceusing standard receiver operating characteristic (ROC)


TABLE 1Remote Homology and Fold Prediction Data Sets

and ROC50 scores. The ROC50 score is the (normalized)area under the ROC curve computed for up to 50 falsepositives. With a small number of positive test sequencesand a large number of negative test sequences, the ROC50score is typically more indicative of the prediction accuracyof a homology detection method than the ROC score.

4.6 Remote Homology Detection

We first compare our proposed multivariate string kernelmethod with a number of state-of-the-art kernel methodsfor remote homology including spectrum/mismatch ker-nels [3], [11], spatial sample kernels (SSSK) [4], semi-supervised cluster kernel [27], as well as state-of-the-artprofile kernel [2].

To obtain multivariate protein sequence representations,we use rows from the BLOSUM substitution matrix asamino-acid feature vectors, i.e., all of the protein sequencesare represented as 20� jXj multivariate (2D) sequences.

In Table 2, we compare in terms of average ROC andROC50 univariate kernels (first column), multivariate (2D)kernels using discrete feature quantization (second col-umn), and multivariate (2D) kernels using binary Hammingembedding (third column).

As can be seen from results in Table 2, multivariate stringkernel provides effective improvements over other stringkernel approaches. For instance, using 2D BLOSUMsubstitution profiles with spectrum and mismatch kernelssignificantly improves average ROC50 scores from 27.91and 41.92 to 43.29 and 49.17, respectively (relative improve-ments of 50 and 17 percent), compared to traditional (1D)spectrum/mismatch approaches.

Similar improvements observed when using spatialsample kernel (SSSK) (average ROC50 increases from50.12 using 1D amino-acid sequences to 55.54 using2D BLOSUM representation with SSSK kernel, 11 percentrelative improvement).

We also observe that the multivariate (2D) kernelprovides substantial improvements in semi-supervisedsettings using semi-supervised cluster kernel [27] and profilekernel approaches. For example, the multivariate kernel onsequence profiles used by the profile kernel (obtained fromnonredundant sequence database (NRDB) [2]) achieves ahigher average ROC50 score of 86.27 compared to 81.51 ofthe profile kernel.

We also note that using direct feature quantizationprovides more effective improvements compared to dis-crete embedding with similarity hashing (see Table 2). Forexample, using similarity hashing with the spectrum andSSSK kernels yields smaller improvements compared todirect feature quantization (average ROC50 score of 39.88versus 43.29, and 54.02 versus 55.54, respectively).

Table 3 shows for each of the kernel methods (spectrum,mismatch, spatial sample, and profile) p-values of theWilcoxon signed-rank test on the ROC50 scores (54 experi-ments) against 1D kernels. As can be seen from the table,observed improvements (see Table 2) are significant.

Table 5 summarizes classification performance of thediscrete (binary) embedding with similarity hashing as afunction of the embedding size (E ¼ 8; 16; 32 bits) and thekernel parameters (k ¼ 5; 8; 10).

In Table 4, we also compare with recently proposedspectrum-RBF and mismatch-RBF methods [12] thatincorporate physicochemical descriptors with traditional


TABLE 2Classification Performance (Mean ROC50) on Protein Remote Homology Detection (54 Experiments)

TABLE 3Statistical Significance of the Differences between 1D (Amino Acid Sequence)Kernels and the Proposed Multivariate Kernels (Remote Homology Detection)

Multivariate similarity kernels perform better than the traditionally used 1D (univariate) kernels.

TABLE 4Protein Remote Homology Detection

Comparison with generative and other state-of-the-art methods.

TABLE 5Remote Homology Prediction

Classification performance (mean ROC50) as a function of theembedding size E and the kernel parameters.

spectrum/mismatch kernels, as well as generative model-based methods (profile HMM from [34], and traditionalSVM-Fisher methods [33]), and recent sequence learningmethods SEQL [28].

We note the our multivariate similarity kernel using onlyBLOSUM substitution scores achieves higher averageROC50 scores (see Table 4) compared to computationallymore expensive spectrum-RBF/mismatch-RBF approaches[12] that exploit richer physicochemical descriptors (BLOSUM,AAindex descriptors, etc).

Table 6 shows results on the second protein remotehomology benchmark (SCOP 1.53) and compares theperformance of the proposed 2D kernels (2D-Mismatchand 2D-SSSK) with the univariate string kernels (mismatchand SSSK), as well as recently developed SVM predictionmethods that use physico-chemical amino-acid descriptors(SVM-PCD [14], SVM-RQA [13]. As shown in the table,multivariate (2D) kernels display the highest ROC50performance on this benchmark.

4.7 Protein Fold Detection

We compare performance on SCOP 1.73 fold benchmark inTable 7. Similar to the observed improvements on proteinremote homology detection (see Table 2), on this morechallenging fold detection task, multivariate (2D) kernels(2D-Mismatch, 2D-SSSK) provide effective improvement

over corresponding univariate mismatch and SSSK kernels(e.g., we observe increase in average ROC50 scores from24.34 to 27.64 using 2D-SSSK kernel as opposed to theunivariate SSSK kernel).

4.8 Multiclass Protein Fold Prediction

For multiclass protein fold prediction (see Table 8, Ding andDubchack data set, 27-folds), using the multivariate stringkernel with BLOSUM profiles (20� jXj) we observesubstantial improvements over 1D mismatch kernel, e.g.,balanced error rate improves from 53.2 to 48.5 percent formismatch-(k ¼ 5, m ¼ 1) kernel (9 percent relative improve-ment). We also note that obtained error rates compare wellwith the error rates of computationally more expensivesubstitution kernel [6], which also uses BLOSUM substitu-tion scores to measure similarity between k-mers.

On a challenging remote fold prediction data set [7](results in Table 9), we observe similar improvementsin ranking quality when using the multivariate kernelwith BLOSUM profiles over corresponding string kernelmethods that use 1D amino-acid sequences. For instance,28.92 percent top-five error rate of the cluster kernel withBLOSUM profile compares well with 35.28 percent errorrate of the state-of-the-art profile kernel.

4.9 MHC Binding Prediction

We test MHC binding prediction performance on threeMHC alleles (A*2301,B*5801,A*0201) with small (104 pep-tides), moderate (988 peptides), and large (3,089 peptides)number of binding peptides. The classification performance(binding versus nonbinding) is evaluated using averageROC scores over fivefold cross-validation (cross-validationfolds are the same as in [30]). Table 10 compares theperformance of the traditional spectrum and weighteddegree (WD) kernels [35] that use position-specific matching.As shown from the results, using multivariate (2D) repre-sentations improves performance over 1D representations(spectrum and WD). Observed improvements are more


TABLE 6Comparison with Univariate String Kernel

and SVM Physicochemical Properties-BasedPredictors on SCOP 1.53 Benchmark

TABLE 7Protein Remote Fold Detection Performance

on SCOP 1.73 Fold Benchmark

TABLE 8Multiclass Protein Fold Prediction [29] (27-Class)

TABLE 9Classification Performance on Fold Prediction (Multiclass) [7]

significant for small and moderate MHC data sets (A*2301,B*5801), e.g., using the multivariate BLOSUM representationachieves a higher average ROC score of 83.28 compared to78.63 ROC of the 1D spectrum kernel.

4.10 Running Time

In Table 11, we compare the running time for the proposedmultivariate string kernel and traditional univariate stringkernel methods. We note that for mismatch-(k,m) kernelcomputation (protein remote homology data), we use lineartime sufficient-statistic-based algorithms from [10]. Asshown in the results, using multivariate similarity kernelsgives similar performance in running times compared totraditional 1D (univariate) kernels while displaying betterclassification performance (e.g., Table 2).

5 CONCLUSIONS

We presented new multivariate string kernel methods forbiological sequence classification that exploit richer multi-variate feature sequence representations (biological se-quence profiles, and sequences of amino acid descriptors,in particular). The proposed approach directly exploitsthese multivariate feature sequences (2D) to improvesequence classification. On three protein sequence classi-fication tasks, this shows a significant 15-20 percentimprovement compared to state-of-the-art sequence classi-fication methods.

REFERENCES

[1] J. Cheng and P. Baldi, “A Machine Learning Information RetrievalApproach to Protein Fold Recognition,” Bioinformatics, vol. 22,no. 12, pp. 1456-1463, June 2006.

[2] R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C.S.Leslie, “Profile-Based String Kernels for Remote HomologyDetection and Motif Extraction,” Proc. IEEE Computational SystemsBioinformatics Conf. (CSB ’04), pp. 152-160, 2004.

[3] C.S. Leslie, E. Eskin, J. Weston, and W.S. Noble, “Mismatch StringKernels for SVM Protein Classification,” Proc. Conf. NeuralInformation Processing Systems Conf., pp. 1417-1424, 2002.

[4] P.P. Kuksa and V. Pavlovic, “Spatial Representation for EfficientSequence Classification,” Proc. 20th Int’l Conf. Pattern Recognition(ICPR ’10), 2010.

[5] S. Sonnenburg, G. Ratsch, and B. Scholkopf, “Large Scale GenomicSequence SVM Classifiers,” Proc. 22nd Int’l Conf. Machine Learning(ICML ’05), pp. 848-855, 2005.

[6] C. Leslie and R. Kuang, “Fast String Kernels Using InexactMatching for Protein Sequences,” J. Machine Learning Research,vol. 5, pp. 1435-1455, 2004.

[7] I. Melvin, E. Ie, J. Weston, W.S. Noble, and C. Leslie, “Multi-ClassProtein Classification Using Adaptive Codes,” J. Machine LearningResearch, vol. 8, pp. 1557-1581, 2007.

[8] P. Kuksa and V. Pavlovic, “Efficient Alignment-Free DNABarcode Analytics,” BMC Bioinformatics, vol. 10, no. Suppl 14,article S9, 2009.

[9] J. Shawe-Taylor and N. Cristianini, Kernel Methods for PatternAnalysis. Cambridge Univ. Press, 2004.

[10] P. Kuksa, P.-H. Huang, and V. Pavlovic, “Scalable Algorithms forString Kernels with Inexact Matching,” Proc. Conf. NeuralInformation Processing Systems (NIPS ’08), 2008.

[11] C.S. Leslie, E. Eskin, and W.S. Noble, “The Spectrum Kernel: AString Kernel for SVM Protein Classification,” Proc. Pacific Symp.Biocomputing, pp. 566-575, 2002.

[12] N. Toussaint, C. Widmer, O. Kohlbacher, and G. Ratsch,“Exploiting Physico-Chemical Properties in String Kernels,”BMC Bioinformatics, vol. 11, no. Suppl 8, article S7, 2010.

[13] Y. Yang, E. Tantoso, and K.-B. Li, “Remote Protein HomologyDetection Using Recurrence Quantification Analysis and AminoAcid Physicochemical Properties,” J. Theoretical Biology, vol. 252,no. 1, pp. 145-154, 2008.

[14] B.-J. Webb-Robertson, K. Ratuiste, and C. Oehmen, “Physico-chemical Property Distributions for Accurate and Rapid PairwiseProtein Homology Detection,” BMC Bioinformatics, vol. 11, no. 1,article 145, 2010.

[15] R.D. King, A. Karwath, A. Clare, and L. Dehaspe, “The Utility ofDifferent Representations of Protein Sequence for PredictingFunctional Class,” Bioinformatics, vol. 17, no. 5, pp. 445-454, 2001.

[16] C.Z. Cai, L.Y. Han, Z.L. Ji, X. Chen, and Y.Z. Chen, “SVM-Prot:Web-Based Support Vector Machine Software for FunctionalClassification of a Protein from Its Primary Sequence,” NucleicAcids Research, vol. 31, pp. 3692-3697, 2003.

[17] C.S. Ong and A. Zien, “An Automated Combination of Kernels forPredicting Protein Subcellular Localization,” Proc. Eighth Int’lWorkshop Algorithms in Bioinformatics (WABI ’08), pp. 186-197, 2008.

[18] T. Hertz and C. Yanover, “PepDist: A New Framework forProtein-Peptide Binding Prediction Based on Learning PeptideDistance Functions,” BMC Bioinformatics, vol. 7, no. Suppl 1,article S3, 2006.

[19] N. Pfeifer and O. Kohlbacher, “Multiple Instance Learning AllowsMHC Class II Epitope Predictions across Alleles,” Proc. Eighth Int’lWorkshop Algorithms in Bioinformatics (WABI ’08), pp. 210-221,2008.

[20] V.N. Vapnik, Statistical Learning Theory. John Wiley & Sones, 1998.[21] C. Cortes, P. Haffner, and M. Mohri, “Rational Kernels: Theory

and Algorithms,” J. Machine Learning Research, vol. 5, pp. 1035-1062, 2004.

[22] S.V.N. Vishwanathan and A. Smola, “Fast Kernels for String andTree Matching,” Proc. Conf. Neural Information Processing Systems(NIPS ’02), 2002.

[23] M. Gribskov, A. McLachlan, and D. Eisenberg, “Profile Analysis:Detection of Distantly Related Proteins,” Proc. Nat’l Academy ofSciences, vol. 84, pp. 4355-4358, 1987.

[24] L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir,H. Nielsen, H.H. Staerfelt, K. Rapacki, C. Workman, C.A.Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak,“Prediction of Human Protein Function from Post-TranslationalModifications and Localization Features,” J. Molecular Biology,vol. 319, no. 5, pp. 1257-1265, 2002.

[25] I. Dubchak, I. Muchnik, S.R. Holbrook, and S.H. Kim, “Predictionof Protein Folding Class Using Global Description of Amino AcidSequence,” Proc. Nat’l Academy of Sciences USA, vol. 92, no. 19,pp. 8700-8704, 1995.

[26] Y. Weiss, A. Torralba, and R. Fergus, “Spectral Hashing,” Proc.Advances in Neural Information Processing Systems, pp. 1753-1760,2009.

[27] J. Weston, C. Leslie, E. Ie, D. Zhou, A. Elisseeff, and W.S. Noble,“Semi-Supervised Protein Classification Using Cluster Kernels,”Bioinformatics, vol. 21, no. 15, pp. 3241-3247, 2005.

[28] G. Ifrim and C. Wiuf, “Bounded Coordinate-Descent for BiologicalSequence Classification in High Dimensional Predictor Space,”Proc. 17th ACM SIGKDD Int’l Conf. Knowledge Discovery and DataMining (KDD ’11), pp. 708-716, 2011.


TABLE 11Running Time for the Kernel Computations

TABLE 10Classification Performance (ROC Scores)

on MHC-Peptide Binding Prediction

[29] C.H. Ding and I. Dubchak, “Multi-Class Protein Fold RecognitionUsing Support Vector Machines and Neural Networks,” Bioinfor-matics, vol. 17, no. 4, pp. 349-358, 2001.

[30] B. Peters, H.-H. Bui, S. Frankild, M. Nielsen, C. Lundegaard, E.Kostem, D. Basch, K. Lamberth, M. Harndahl, W. Fleri, S.S. Wilson,J. Sidney, O. Lund, S. Buus, and A. Sette, “A Community ResourceBenchmarking Predictions of Peptide Binding to MHC-I Mole-cules,” PLoS Computational Biology, vol. 2, no. 6, article e65, 2006.

[31] L. Lo Conte, B. Ailey, T. Hubbard, S. Brenner, A. Murzin, and C.Chothia, “SCOP: A Structural Classification of Proteins Database,”Nucleic Acids Research, vol. 28, no. 1, pp. 257-259, 2000.

[32] M.S. Venkatarajan and W. Braun, “New Quantitative Descriptorsof Amino Acids Based on Multidimensional Scaling of a LargeNumber of Physical-Chemical Properties,” J. Molecular Modeling,vol. 7, pp. 445-453, 2001.

[33] T. Jaakkola, M. Diekhans, and D. Haussler, “Using the FisherKernel Method to Detect Remote Protein Homologies,” Proc.Seventh Int’l Conf. Intelligent Systems for Molecular Biology, pp. 149-158, 1999.

[34] P. Huang and V. Pavlovic, “Protein Homology Detection withBiologically Inspired Features and Interpretable StatisticalModels,” Int’l J. Data Mining and Bioinformatics, vol. 2, no. 2,pp. 157-175, June 2008.

[35] G. Raetsch and S. Sonnenburg, Accurate Splice Site Detection forCaenorhabditis Elegans, pp. 277-298, MIT Press, 2004.

Pavel P. Kuksa received the bachelor’s degreewith honors in computer engineering in 2002, themaster’s degree with honors in information andcomputer sciences in 2005, both from theBauman Moscow State Technical University,and the PhD degree from the Department ofComputer Science, Rutgers University, in 2011.His research has always been focused onadvancing the state of the art in both the theoryand practice of machine learning algorithms,

algorithms for sequence and time series analysis, bioinformatics, naturallanguage processing, and large-scale data analysis.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


biological sequence classification with multivariate string kernels

Documents