asurveyofkernelsforstructured datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata...

10
ABSTRACT Keywords A Survey of Kernels for Structured Data Thomas Gartner Fraunhofer Institut Autonome Intelligente Systeme, Germany; Department of Computer Science, University of Bristol, United Kingdom ; and Department of Computer Science III, University of Bonn, Germany Kernel methods in general and support vector machines in particular have been successful in various learning tasks on data represented in a single table . Much 'real-world' data, however, is structured - it has no natural representation in a single table. Usually, to apply kernel methods to 'real- world' data, extensive pre-processing is performed to embed the data into a real vector space and thus in a single table . This survey describes several approaches of defining positive definite kernels on structured instances directly. Kernel methods, structured data, multi-relational data min- ing, inductive logic programming 1 . INTRODUCTION Most 'real-world' data has no natural representation as a single table, yet most data mining research has so far concen- trated on discovering knowledge from single tables . In order to apply `traditional' data mining methods to structured data, extensive pre-processing has to be performed . Re- search in inductive logic programming and multi-relational data mining [8] aims to reduce these pre-processing efforts by considering learning from multi-relational data descrip- tions directly . Some algorithms developed in that field are upgrades of algorithms, originally developed only with single tables data in mind . Among the so far upgraded algorithms are popular data mining methods such as decision trees, rule learners, and distance-based algorithms . Support vector machines [3 ; 41] are among the most success- ful recent developments within the machine learning and data mining communities . Along with some other learn- ing algorithms like Gaussian processes and kernel principal component analysis, they form the class of kernel methods [32; 37] . The computational attractiveness of kernel meth- ods comes from the fact that they can be applied in high dimensional feature spaces without suffering the high cost of explicitly computing the mapped data . The kernel trick is to define a positive definite kernel on any set . For such functions it is know that there exists an embedding of the set in a linear space such that the kernel on the elements of the set corresponds to the inner product in this space . While the inductive logic programming community has tra- ditionally used logic programs to represent structured data, Tomas .gaertner@ais.fraunhofer.d e the scope has now extended and also includes other knowl- edge representation languages . Development of kernels for structured data has mostly been motivated and guided by , real-world' problems . Although the structure of these prob- lems is often such that they do not permit a natural repre- sentation in a single table, the full power of logic programs is hardly ever needed . This is one reason, kernel design has concentrated on different and sometimes less powerful knowledge representations. This paper is not intended as an introduction to kernel-based learning algorithms, for such an introduction the reader is referred to one of the excellent books [5 ; 37] or tutorials [2 ; 32] on kernel methods ([2] is available online) . Instead, this paper intends to give an introduction to kernel functions defined on structured data. For that, we assume some basic familiarity with linear algebra. In this paper we distinguish kernels according to whether they are defined on the structure of the instances (syntax- driven kernels), or on the structure of the instance space (model-driven kernels) . Furthermore, we distinguish accord- ing to the power of the knowledge representation used . The outline of the paper is as follows : Section 2 introduces kernel functions and characterizes valid and good kernels. It also presents a classification scheme for kernel functions. Model- driven kernels are then described in section 3. After that syntax-driven kernels are described in section 4 . Finally, section 5 concludes . 2 . KERNEL FUNCTIONS Two components of kernel methods have to be distinguished : the kernel machine and the kernel function . While the ker- nel machine encapsulates the learning task and the way in which a solution is looked for, the kernel function encapsu- lates the hypothesis language, i .e ., how the set of possible solutions is made up. Different kernel functions implement different hypothesis spaces or even different knowledge rep- resentations. 2 .1 Kernels on Vectors Before we give the definition of positive definite kernels, the traditionally used kernels (on vector spaces) are briefly re- viewed in this section . Let x, x E R" and let ( ., -) denote the scalar product in R" . Apart from the linear kernel k(x, x') = (x, x') and the normalized linear kernel k(x x') - (x, x ) (x, x) (x'~ x')

Upload: others

Post on 16-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

ABSTRACT

Keywords

A Survey of Kernels for Structured Data

Thomas GartnerFraunhofer Institut Autonome Intelligente Systeme, Germany;

Department of Computer Science, University of Bristol, United Kingdom ; andDepartment of Computer Science III, University of Bonn, Germany

Kernel methods in general and support vector machines inparticular have been successful in various learning tasks ondata represented in a single table . Much 'real-world' data,however, is structured - it has no natural representation ina single table. Usually, to apply kernel methods to 'real-world' data, extensive pre-processing is performed to embedthe data into a real vector space and thus in a single table .This survey describes several approaches ofdefining positivedefinite kernels on structured instances directly.

Kernel methods, structured data, multi-relational data min-ing, inductive logic programming

1 . INTRODUCTIONMost 'real-world' data has no natural representation as asingle table, yet most data mining research has so far concen-trated on discovering knowledge from single tables . In orderto apply `traditional' data mining methods to structureddata, extensive pre-processing has to be performed. Re-search in inductive logic programming and multi-relationaldata mining [8] aims to reduce these pre-processing effortsby considering learning from multi-relational data descrip-tions directly. Some algorithms developed in that field areupgrades of algorithms, originally developed only with singletables data in mind . Among the so far upgraded algorithmsare popular data mining methods such as decision trees, rulelearners, and distance-based algorithms .Support vector machines [3 ; 41] are among the most success-ful recent developments within the machine learning anddata mining communities . Along with some other learn-ing algorithms like Gaussian processes and kernel principalcomponent analysis, they form the class of kernel methods[32; 37] . The computational attractiveness of kernel meth-ods comes from the fact that they can be applied in highdimensional feature spaces without suffering the high costof explicitly computing the mapped data . The kernel trickis to define a positive definite kernel on any set . For suchfunctions it is know that there exists an embedding of theset in a linear space such that the kernel on the elements ofthe set corresponds to the inner product in this space.While the inductive logic programming community has tra-ditionally used logic programs to represent structured data,

[email protected]

the scope has now extended and also includes other knowl-edge representation languages . Development of kernels forstructured data has mostly been motivated and guided by,real-world' problems . Although the structure of these prob-lems is often such that they do not permit a natural repre-sentation in a single table, the full power of logic programsis hardly ever needed . This is one reason, kernel designhas concentrated on different and sometimes less powerfulknowledge representations.This paper is not intended as an introduction to kernel-basedlearning algorithms, for such an introduction the reader isreferred to one of the excellent books [5 ; 37] or tutorials [2 ;32] on kernel methods ([2] is available online). Instead, thispaper intends to give an introduction to kernel functionsdefined on structured data. For that, we assume some basicfamiliarity with linear algebra.In this paper we distinguish kernels according to whetherthey are defined on the structure of the instances (syntax-driven kernels), or on the structure of the instance space(model-driven kernels) . Furthermore, we distinguish accord-ing to the power of the knowledge representation used . Theoutline of the paper is as follows : Section 2 introduces kernelfunctions and characterizes valid and good kernels. It alsopresents a classification scheme for kernel functions. Model-driven kernels are then described in section 3. After thatsyntax-driven kernels are described in section 4. Finally,section 5 concludes .

2. KERNEL FUNCTIONSTwocomponents ofkernel methods have to be distinguished :the kernel machine and the kernel function . While the ker-nel machine encapsulates the learning task and the way inwhich a solution is looked for, the kernel function encapsu-lates the hypothesis language, i.e ., how the set of possiblesolutions is made up. Different kernel functions implementdifferent hypothesis spaces or even different knowledge rep-resentations.

2.1

Kernels on VectorsBefore we give the definition of positive definite kernels, thetraditionally used kernels (on vector spaces) are briefly re-viewed in this section . Let x, x E R" and let ( ., -) denotethe scalar product in R" . Apart from the linear kernel

k(x, x') = (x, x')

and the normalized linear kernelk(x x')

-

(x, x)(x, x) (x'~ x')

Page 2: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

which corresponds to the cosine of the angle enclosed bythe vectors, the two most frequently used kernels on vectorspaces are the polynomial kernel and the Gaussian kernel .Given two parameters l E R, p E N+ the polynomial kernelis defined as :

k(x, x) =

x, x') + 1)'

The intuition behind this kernel definition is that it is of-ten useful to construct new features as products of originalfeatures . This way for example the XOR problem can beturned into a linearly separable problem. The p in abovedefinition is the maximal order of monomials making up thenew feature space, the l is a bias towards lower order mono-mial. If l = 0 the feature space consists only of monomialsof order p of the original features .EXAMPLE : Consider the positive examples (+1, +1), (-1, -1)and the negative examples (-1-1, -1), (-1, +1). Clearly, thereis no straight line separating positive from negative exam-ples . However, if we use the transformation 0 : (x1, x2) -~(xi,V x)23;Ix2, 2 separation is possible with the plane or-thonormal to (0,1,0) as the the sign of x1 x2 already cor-responds to the class. To see that this transformation canbe performed implicitly by a polynomial kernel, let x =(x1, x2), z = (z1, z2) and k(x, z) = (x, z) 2. Then

k(x,z) = ((XI,x2),(z1,z2))2 = (xlz1 + x222)2

= (xlzl)2 + 2x1x2zlz2 + (x222)2

_ ( (xI V 2XIx2, x2)

(zi, ~zlz2, z2) /= (0(x)>0(z))

Now, let, x+(x- ) be either of the positive (negative) exam-ples given above. Any example x can be classified withoutexplicitly transforming instances, as k(x, x+)-k(x, x- ) cor-responds to implicitly projecting x onto the vector O(x+) -0(2;-) = (0, f, 0) .Given the bandwidth parameter v the Gaussian kernel isdefined as:

Using this kernel function in a support vector machine canbe seen as using a radial basis function network with Gaus-sian kernels centered at the support vectors . The imagesof the points from the vector space R" under the map 0R" -+ R with k(x,x') = (O(x), ¢(x')) lie all on the surfaceof a hyperball in the Hilbert space 9-t. No two images areorthogonal and any set of images is linearly independent .The parameter v can be used to tune how much general-ization is done . For very high v, all vectors O(x) are al-most parallel and thus almost identical . For very small v,the vectors ¢(x) are almost orthogonal to each other andthe Gaussian kernel behaves almost like the matching ker-nel ka (x,x) = 1 t* x = x' and kb (x, x) = 0 t* x 54 x. Inapplications this often causes a problem known as the ridgeproblem; which means that the learning algorithm functionsmore or less just as a lookup table .

2.2

Valid KernelsTechnically, a kernel k corresponds to the inner product insome feature space which is, in general, different from therepresentation space of the instances. The computationalattractiveness of kernel methods comes from the fact that

quite often a closed form of these `feature: space inner prod-ucts' exists . Instead of performing the expensive transfor-mation step explicitly, the kernel can be calculated directly,thus performing the feature transformation only implicitly.Whether, for a given function k : X x X -r R, a featuretransformation 0 : X~ ?-( into the Hilbert space 9-l exists,such that k(x, x') = (O(x), O(x')) for all x, x' E X can bechecked by verifying that the function is positive definite[1] . This means that any set, whether a linear space or not,that admits a positive definite kernel can be embedded into alinear space. Throughout the paper, we take `valid' to mean`positive definite' . Here then is the definition of a positivedefinite kernel. (7L+ is the set of positive integers .)DEFINITION : Let X be a set. A symmetric function kX x X ~ R is a positive definite kernel on X if, for alln E 7G+, x1, . . . , x� E X, and c1, . . . , c� F_ R, it follows that

ci cj k(xi, xj) >_ 0.

While it is not always easy to prove positive definiteness fora given kernel, positive definite kernels do have nice closureproperties. In particular, they are closed under sum, directsum, multiplication by a scalar, product, tensor product,zero extension, pointwise limits, and exponentiation [5] .

2.3

Good KernelsFor a kernel method to perform well on some domain, va-lidity of the kernel is not the only issue. To discuss thecharacteristics of good kernels, we need the notion of a con-cept class. Concepts c are functions c : X --, S2, where X isoften referred to as the instance space or problem domain(examples are elements of this domain) and Sl are Booleanlabels . A concept class C is a set of concepts .While there is always a valid kernel that performs poorly(ko(x, x) = 0), there is also always a valid kernel (k,(x, x) =+1 ~#* c(x) = c(x') and k,(x, x) = -1 -~* c(x) 54 c(x)) thatperforms ideally. We distinguish the following three issuescrucial to `good' kernels : completeness, correctness, and ap-propriateness .Completeness refers to the extent to which the knowledgeincorporated in the kernel is sufficient for solving the prob-lem at hand . A kernel is said to be complete if it takesinto account all the information necessary to represent theconcept that underlies the problem domain. Formally, wecall a kernel complete if k(x, -) = k(x', -) implies x = x'1 . With respect to some concept class, however, it is notimportant to distinguish between instances that are equallyclassified by all concepts in that particular concept class.We call a kernel complete with respect to a concept class Cif k(x, -) = k(x', -) implies c(x) = c(x) for all c E C.Correctness refers to the extent to which the underlyingsemantics of the problem are obeyed in the kernel . Cor-rectness can formally only be expressed with respect to acertain concept class and a certain hypothesis language . Inparticular, we call a kernel correct with respect to a con-cept class C and support vector machines if for all con-cepts c E C we can find ai E R, xi E X, 0 E R such that'dx E X : Ei aik(xi, x) >_ 0 « c(x) . For the remainder ofthe paper we will use `correct' only with respect to supportvector machines and other kernel methods using a similarhypothesis language .

'This corresponds to the map 0, with (~(x), ~(x))

_k(x, x) for all x, x, being injective.

Page 3: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

Appropriateness refers to the extent to which examples thatare close to each other in class membership are also `close' toeach other in feature space. Appropriateness can formallyonly be defined with respect to aconcept class and alearningalgorithm . A kernel is appropriate for learning concepts ina concept class if polynomial mistake bounds can be derivedfor some algorithm using this kernel. This problem is mostapparent in the the matching kernel ks (x, x) = 1 t* x = x'and k& (.x, x) = 0 t* x 54 x' which is always complete andcorrect but (in general) not appropriate.Empirically, a complete, correct and appropriate kernel ex-hibits two properties . A complete and correct kernel sep-arates the concept well, i.e ., a learning algorithm achieveshigh accuracy when learning and validating on the same partof the data . An appropriate kernel generalizes well, i.e., alearning algorithm achieves high accuracy when learning andvalidating on different parts of the data.

2.4

Classes of KernelsA useful conceptual distinction between different kernels isbased on the 'driving-force' of their definition . We distin-guish between syntax and models as the driving-force of thekernel definition .Syntax is often used in typed systems to formally describethe semantics of the data . It is the most common drivingforce. In its simplest case, i.e ., untyped attribute-value rep-resentations, it treats every attribute in the same way. Morecomplex syntactic representations are graphs, restricted sub-sets of graphs such as lists and trees, or terms in some (pos-sibly typed) logic. Whenever kernels are syntax-driven, theyare either special case kernels, assuming some underlying se-mantics of the application, or they are parameterized andoffer the possibility to adapt the kernel function to certainsemantics .Models contain some kind of knowledge about the instancespace, i.e ., about the relationships among instances. Thesemodels can either be generative models of instances or theycan be given by some sort of transformation relations. Hid-den Markov models are a frequently used generative model.Edit operations on a lists are one example for operationsthat transform one list into another. The graph defined onthe set of lists by these operations can be seen as a model ofthe instance space. While each edge ofa graph only containslocal information about neighboring vertices, the set of alledges, i .e., the graph itself, also contains information aboutthe global structure of the instance space.

3. MODEL-DRIVEN KERNELSThis section describes different kernel functions defined onmodels describing the instance space. These models are ei-ther constructed form background knowledge about the se-mantics of the domain at hand or learned from data . Thefirst part of this section deals with kernel functions definedon probabilistic, generative models of the instance space.The second part of the this section describes kernel func-tions defined using some kind of similarity relation or trans-formation operation between instances .

3.1

Kernels from Generative ModelsGenerative models in general and hidden Markov models[35] in particular are widely used in computer science . Oneof their application areas is protein fold recognition where

51

one tries to understand how proteins fold up in nature. An-other application area is speech recognition . One motivationbehind the development of kernels on generative models isto be able to apply kernel methods to sequence data . Se-quences occur frequently in nature, for example, proteinsare sequences of amino acids, genes are sequences of nucleicacids, and spoken words are sequences of phonemes. An-other motivation is to improve the classification accuracy ofgenerative models .The first and most prominent kernel function based on agenerative model is the Fisher kernel [17; 18] . The key ideais to use the gradient of the log-likelihood with respect tothe parameters of a generative model as the features in adiscriminative classifier. The motivation to use this featurespace is that the gradient of the log-likelihood with respectto the parameters of a generative model captures the gen-erative process of a sequence rather that just the posteriorprobabilities .Let Ux be the gradient of the log-likelihood with respect tothe parameters of the generative model P(xJ0) at x:

Ux = Ve log P(XI B)

Furthermore, let I be the Fisher information matrix, i.e .,the expected value of the outer product UxU: over P(xJ0) .The Fisher kernel is then defined as k(x,x) = UII-l Ux, .The Fisher kernel can be calculated whenever the proba-bility model P(xJ0) of the instances given the parametersof the model has a twice differentiable likelihood and theFisher information matrix is positive definite at the chosen0. Learning algorithms using the Fisher kernel can be shownto perform well if the class variable is contained as a latentvariable in the probability model. In [17] it has been shownthat under this condition kernel machines using the Fisherkernel are asymptotically at least as good as choosing themaximum a posteriori class for each instance based on themodel. In practice often the role of the Fisher informationmatrix is ignored, yielding the kernel k(x, x') = U5Ux, .Usually, as a generative model a hidden Markov model isused and as a discriminative classifier a support vector ma-chine is used . The Fisher kernel has successfully been ap-plied in many learning problems where the instances aresequences of symbols, such as protein classification [16 ; 20]and promoter region detection [34] .The key ingredient of the Fisher kernel is the Fisher scoremapping Ux that extracts a feature vector from a generativemodel. In [39] performance measures for comparing suchfeature extractors are discussed . Based on this discussion,a kernel is defined on models where the class is an explicitvariable in the generative model, rather than only a latentvariable as in the Fisher kernel . Empirically, this kernelperforms favorably to the Fisher kernel on a protein foldrecognition task . A similar approach has been applied tospeech recognition [38] .Recently, a general framework for defining kernel functionson generative models has been described [40] . The so-called`marginalized kernels' contain the above described Fisherkernel as a special case. The paper compares other marginal-ized kernels with the Fisher kernel and argues that thesehave some advantages over the Fisher kernel . While, for ex-ample, Fisher kernels only allow for the incorporation ofsecond-order information by using a second-order hiddenMarkov model [7], other marginalized kernels allow for theuse of second-order information with a first-order hidden

Page 4: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

Markov model. In [40] it is shown that incorporating thissecond-order information in a kernel function is useful in theprediction of bacterial genera from their DNA.In general, marginalized kernels are defined on any genera-tive model with some visible and some hidden information .Let the visible information be an element of the finite setX and the hidden information be an element of the finiteset S. If the hidden information was known, a joint kernelk~ : (X x S) x (X x S) --> ill; could be used. Usually, the hid-den information is unknown but the expectation of a jointkernel with respect to the hidden information can be used .Let x, x' E X and s, s' E S. Given a joint kernel k. andthe posterior distribution p(slx) (usually estimated from agenerative model), the marginalized kernel in X is definedas:

k(x, x')

p(sJx)p(sJx)k=((x, s), (x , s'))s,s'Es

To complete this section on kernels from generative mod-els, the idea of defining kernel functions between sequencesbased on a pair hidden Markov model [7] has to be men-tioned . Such kernels have been developed in [15] and [45] .Strictly speaking pair hidden Markov models are not mod-els of the instance space, they are generative models of analigned pair of sequences [7] .Recently, [36] has shown that syntactic string kernels (pre-sented in section 4.2) can be seen as a special case of Fisherkernels of a k-stage Markov process with uniform distribu-tions over the transitions .

3.2

Kernels from TransformationsThis section describes kernels that are based on knowledgeabout common properties of instances or transformationsbetween instances . The best known kernel in this class isthe diffusion kernel .The motivation behind diffusion kernels [24] is that it is oftenmore easy to describe the local neighborhood of an instancethan to describe the structure of the whole instance spaceor to compute a similarity between two arbitrary instances .The neighborhood of an instance might be all instances thatdiffer with this one only by the presence or absence of oneparticular property. When working with molecules, for ex-ample, such properties might be functional groups or bonds.The neighborhood relation obviously induces global infor-mation about the make up of the instance space. The ap-proach taken in the diffusion kernel is to try to capture thisglobal information in a kernel function merely based on theneighborhood description .The main mathematical tool used in diffusion kernels is ma-trix exponentiation . The exponential of a square matrix His defined as

nepH = lim Z

(aH)xn-- t=0

It is known that the limit always exists and that epH is apositive definite matrix if H is symmetric. In this case it isalso possible to compute epH efficiently by first diagonalizingthe matrix H such that H =T-1DT and then computing

epH =T-1 e"DT

where ep° can be computed component-wise (as D is diag-onal) .

The matrix H is called the `generator' . The kernel matrixis defined as the exponential of the generator. In the case ofinstance spaces that have undirected graph structure, [24]suggests to use the negative Laplacian of the graph as thegenerator. Let G = (V, £) be an undirected graph withvertices V= {vi} and edges £ C 2v . In particular {vi, vj } E£ if there is an edge between vertices vi and v; . Furthermore,let 6(vi) = {vj E V : {vi , vj } E £} .

Then the negativeLaplacian of the graph is given by

1

{vi,vj }EE[H]i,j _

-15(-01

vi = vj10 otherwise

where Hij denotes the i, j-th component of the generatormatrix . To generalize this to the case of graphs withweightedand/or parallel edges, the components of Hij for vi 54 vj arereplaced by the sum over the weights of all edges betweenvi and vj.If the instance space is big, the computation of epH sug-gested above might still be too expensive . For some specialinstance space structures, such as regular trees, completegraphs, and closed chains [24] gives closed forms for directlycomputing the exponential of the generator and thus thekernel matrix. An application of the diffusion kernel to genefunction prediction has been described in [43] .A similar idea of defining a kernel function on the structureof the instance space is described in [42] . In that paper it isdescribed how global patterns of inheritance can be incorpo-rated in a kernel function and how missing information canbe dealt with using a probabilistic model of inheritance . Themain difference to the diffusion kernel is that the structuresconsidered are directed trees and that instances correspondto sets of vertices in these trees rather than single instances .Trees are connected acyclic graphs where one vertex has noincoming edge and all other nodes have exactly one incom-ing edge . The application considered in that paper is thatof classifying phylogenetic profiles . A phylogenetic profilecontains information about the organisms in which a partic-ular gene occurs . The phylogenetic information consideredin [42] is represented as a tree such that each leaf (a vertexwith no outgoing edge) corresponds to one living organismand every other vertex corresponds to some ancestor of theliving organisms. To represent genes, every vertex of thetree is assigned a random variable . The value of this ran-dom variable indicates whether the corresponding organismhas a homologue of the particular gene or not.If the genomes of all ancestor organisms were available, thisinformation could be used to define a kernel function re-flecting the similarity between evolutions . A subtree of theabove described tree along with an assignment of indicatorvalues to this tree is called an evolution pattern. Ideally thekernel function on two genes would be defined as the num-ber of evolution patterns that agree with the phylogenetichistories of both genes. An evolution pattern agrees with aphylogenetic history if the assignment of indicator variablesis the same for all vertices in the evolution pattern. Asthe genomes are only known for some ancestor organisms,a probabilistic model that can be used to estimate missingindicator variables is suggested in [42] .For two given phylogenetic profiles XL, yL the kernel is de-fined as follows . Let T be a tree, L be the leaf nodes andC(T) the set of all possible subtrees of T. One particular

Page 5: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

evolution pattern can be expressed as a subtree S E C(T)and the corresponding assignment zS of indicator values .p(zs) denotes the probability of such an evolution patternhaving occurred in nature and p(XL IZS) is the probabilityof observing a particular phylogenetic profile xL given theevolution pattern zS (obtained from the probabilistic modelmentioned above) . Then the tree kernel for phylogeneticprofiles is defined as :

k(XL,YL) =

E

Ep(ZS)p(XLJZS)p(YL]ZS)SEC(T) ZS

In [42] it is shown that this kernel function can be computedin time linear in the size of the tree .

4. SYNTAX-DRIVEN KERNELSThis section describes different kernel definitions based onthe syntax of the representation. The simplest way to applykernel methods to multi-relational data it to first transformthe data, into a single table . This process is called propo-sitionalization [25] . The first application of support vectormachines to propositionalized data is described in [26] . Wewill not consider such approaches in more detail in this pa-per, we will rather concentrate on kernels defined directlyon structured data.The most prominent kernel for representation spaces thatare not mere attribute-value tuples, the convolution kernel,is introduced first . Its key idea is to define a kernel on acomposite object by means of kernels on the parts of theobjects. This idea is also reflected in many of the kernelfunctions developed later and described thereafter in thissection . Several kernel functions have been defined on se-quences of discrete symbols. The idea is always to extract allpossible subsequences (of some kind) and to define the ker-nel function based on the occurrence and similarity of thesesubsequences . This idea can be generalized to extractingsubtrees of trees and defining a kernel function based oc-currence and similarity of subtrees in two trees. A ratherdifferent approach is that ofrepresenting the instances of thelearning task as terms in a typed higher-order logic . Theseterms are powerful enough to allow for a close modeling ofthe semantics of different objects by means of the syntax ofthe representation. The last kernel described in this sectionis a kernel defined on instances represented by graphs .

4.1 ConvolutionsThe best known kernel for representation spaces that arenot mere attribute-value tuples is the convolution kernelproposed by Haussler [15] . The basic idea of convolutionkernels is that the semantics of composite objects can of-ten be captured by a relation R between the object and itsparts . The kernel on the object is then made up from kernelsdefined on different parts .Let x, Z E X be the objects and ~F, i' E X1 x . . . x XDbe tuples of parts of these objects . Given the relation R(X1 x . . . x XD) x X we can define the decomposition R-1as R-1(x) = {x : R(Y,x)} . Then the convolution kernel isdefined as

D

ka(xa, xa)x2ER -1 (¢)X ER-1 (x 1 ) a=1

The term `convolution kernel' refers to a class of kernelsthat can be formulated in the above way. The advantage

of convolution kernels is that they are very general and canbe applied in many different problems . However, because ofthat generality, they require a significant amount of work toadapt them to a specific problem, which makes choosing Rin 'real-world' applications a non-trivial task .

4.2 StringsWhile the traditional kernel function used for text classi-fication is simply the scalar product of two texts in their'bag-of-words' representation [19], this kernel function doesnot take the the structure of the text or words into accountbut simply the number of times each word occurs . Moresophisticated approaches try to define a kernel function onthe sequence of characters . Similar approaches define ker-nel functions on other sequences of symbols, e.g ., on thesequence of symbols each corresponding to one amino acidand together describing a protein .The first kernel function defined on strings can be found in[46 ; 31] and is also described in [5 ; 37]. The idea behindthis kernel is to base the similarity of two strings on thenumber of common subsequences. These subsequences neednot be contiguous in the strings but the more gaps in theoccurrence of the subsequence, the less weight is given to itin the kernel function . This can be best illustrated by anexample.Consider the two strings `cat' and `cart' . The common sub-sequences are `c', `a', `t', `ca', `at', 'et', `cat' . As mentionedabove, it is useful to penalize gaps in the occurrence of thesubsequence . This can be done using the total length of asubsequence in the two strings. We now list the commonsubsequences again and give the total length of their oc-currence in `cat' and `cart' as well : `c':1/1, `a':1/1, `t':1/1,`ca' :2/2, `at' :2/3, `ct' :3/4, `cat':3/4. Usually, an exponentialdecay is used . With a decay factor \ the penalty associ-ated with each substring is `c' :(x1,\1), `a':(VA1),`ca' : (,\2A2), 'at' :(A2A'), `ct' :(VA4), `cat':(A3A4). The kernelfunction is then simply the sum over these penalties, i.e .,k(`cat', `cart') = 2\r +A5 +A4 +3A2.Let E be a finite alphabet, Eu the set of strings of length nfrom that alphabet and E' the set of all strings from thatalphabet . Let ]s] denote the length of the string s E E* andlet the symbols in s be indexed such that s = 3182 . . . s1s1 .

For a set of indices i let s[i] be the subsequence of s inducedby this set of indices and let l(i) denote the total lengthof s[i) in s, i.e ., the biggest index in i minus the smallestindex in i plus one. We are now able to define the featuretransformation 0 underlying the string kernel . For somestring u E E" the value of the feature Ou(s) is defined as :

Ou (s) _ E A, (i)W

i :u=s[il

The kernel between two strings s, t E E" is then simply thescalar product of ¢(s) and 0(t) and can be written as :

uErn

uE£n i:u-s(il j :u=t(jl

While this computation appears very expensive, recursivecomputation can be reduced to C(nJsJJtJ) [31] .

An alternative to the above kernel has been used in [33] and[27] where only contiguous substrings of a given string areconsidered . A string is then represented by the number oftimes each unique substring of (up to) n symbols occurs in

Page 6: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

the sequence . This representation of strings by their con-tiguous substrings is known as the spectrum of a string oras its n-gram representation . The kernel function is simplythe scalar product in this representation . It is shown in [27]that this kernel can be computed in time linear in the lengthof the strings and the length of the considered substrings . In[33] not all possible n-grams are used but a simple statisticaltest is employed as a feature subset selection heuristic . Thiskernel has been applied to protein [27] and spoken text [33]classification . Spoken text is represented by a sequence ofphonemes, syllables, or words.Many empirical results comparing the above kernel functionscan be found in [30] . As shown in [36] these kernels canbe seen as a special case of the Fisher kernel (presented insection 3.1) . This perspective leads to a generalization ofthe above kernel that is able to deal with variable lengthsubstrings . In [44] and [28] string kernels similar to the n-gra.m kernel are considered and it is shown how these canbe computed efficiently by using suffix and mismatch trees,respectively. The main conceptual difference to the n-gramkernel is that a given number of mismatches is allowed whencomparing the n-grams to the substrings .Another kernel on strings can be found in [47] . The focus ofthat paper is on the recognition of translation inition sites inDNA or mRNA sequences. This problem is quite differentfrom the above considered applications . The main differenceis that rather than classifying whole sequences, in this taskone colon (three consecutive symbols) from a sequence hasto be identified as the translation inition site. Each sequencecan have arbitrarily many candidate solutions of which onehas to be chosen . However, in [47] and earlier work thisproblem is converted into a traditional classification problemon sequences of equal length .Fixed length windows of symbols centered at each candidatesolution are extracted from each sequence . The class of onesuch window corresponds to the candidate solution being atrue translation inition site or not. One valid kernel on thesesequences is simply the number of symbols that coincidein the two sequences . Other kernels can, for example, bedefined as a polynomial of this kernel .Better classification accuracy is, however, reported in [47]for a kernel that puts more emphasis on local correlations .Let n be the length of each window and E be the set ofpossible symbols, then each window is an element of E'° . Forx, X E E" let xi, x; denote the i-th element of each sequence.Using the matching kernel kd on E (kb : ExE --> R is definedas ka(xi,x,) = 1 if xi = x; and kb(xi,x;) = 0 if xi 0 xi)first a polynomial kernel on a small sub-window of length21 + 1 with weights wj E R and power d1 is defined :

tki (x,x ) =

E wjks(xi+j,xi+j)j=-t

Then the kernel on the full window is simply the polynomialof power d2 of the kernels on the sub-windows:

_1

d2

k(x, x) _ ~~ki (x, x')

~i-l

)

dl

This kernel is called the 'locality-improved' kernel . An em-pirical comparison to other general purpose machine learn-ing algorithms shows competitive results for l = 0 and better

results for larger l . These results were obtained on the abovementioned translation inition site recognition task .Even better results can be achieved by replacing the symbolat some position in the above definition with the conditionalprobability of that symbol given the previous symbol . LetPi,TIS(xijxi-1) be the probability of symbol xi at position igiven symbol xi-I at position i - 1, estimated over all truetranslation inition sites. Furthermore, let Pi,ALL(xilxi-1) bethe probability of symbol xi at position i given symbol xi-1at position i - 1, estimated over all candidate sites . Thenwe define

Si(x) = l0gPi,TIS(xi'xi-1) - l0gPi,ALL(xilxi-1)and replace xi+j in the locality-improved kernel by si+j(x)and the matching kernel by the product. A support vec-tor machine using this kernel function outperforms all otherapproaches applied in [47] .

4.3 TreesA kernel function that can be applied in many natural lan-guage processing tasks is described in [4] . The instances ofthe learning task are considered to be labeled ordered di-rected trees. The key idea to capture structural informationabout the trees in the kernel function is to consider all sub-trees occurring in a parse tree. Here, a subtree is defined asa connected subgraph of a tree such that either all childrenor no child of a vertex is in the subgraph. The children of avertex are the vertices that can be reached from the vertexby traversing one directed edge. The kernel function is theinner product in the space which describes the number ofoccurrences of all possible subtrees.Consider some enumeration of all possible subtrees and lethi(T) be the number of times the i-th subtree occurs in treeT. For two trees Ti, T2 the kernel is then defined as

k(TI,T2) = E hi(TI)hi(T2)

Furthermore, for the sets of vertices V1 and V2 of the treesTl and T2, let S(v1, v2) with vl E V1, v2 E V2 be the numberof subtrees rooted at vertex v1 and v2 that are isomorphic .Then the tree kernel can be computed as

k(T1,T2) = E hi(Ti)hi(T2) =

E

S(v1,v2)i

vl EV1,v2E VZ

Let label(v) be a function that returns the label of vertex v,let lb+ (v)l denote the number of children of vertex v, andlet b+ (v, j) be the j-th child of vertex v (only ordered treesare considered) . S(v1,V2) can efficiently be calculated asfollows : S(VI,v2) = 0 if label(vi) j4 label(v2) . S(v1,v2) = 1if label(v1) = label(v2) and jb+(v1 )j = 1b+(v2)1 = 0 . Other-wise2,

la +(ul)IS(v1 , v2)= 11 (1+S(b+(VI,j),b+(v2,j)))

k=1

This recursive computation has time complexity O(jV1 JIV21) .Experiments investigated howmuch the application ofa ker-nelized perceptron algorithm to trees generated by a prob-abilistic context free grammar outperforms the use of the2Note that in this case 1b+ (vl)l = 1b+(v2)1, as actually thenumber of children of a vertex is determined by its label.This is due to the nature of the natural language processingapplications that are considered .

Page 7: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

probabilistic context free grammar alone. The empirical re-sults achieved in [4] are promising. The kernel function usedin these experiments is actually a weighted variant of thekernel function presented above.A generalization of this kernel to also take into account othersubstructuresof the trees is described in [22] . A substructureof a tree is defined as a tree such that there is a descendantspreserving mapping from vertices in the substructure to ver-tices in the tree3. Another generalization considered in thatpaper is that of allowing labels to partially match. Promis-ing results have been achieved with this kernel function inHTML document classification tasks .Recently, [44] proposed the application of string kernels totrees by representing each tree by the sequence of labelsgenerated by a depth-first traversal of the trees, written inpreorder notation . To ensure that trees only differing in theorder of their children are represented in the same way, thechildren of each vertex are ordered according to the lexicalorder of their string representation .

4.4

Basic TermsIn [14] a framework has been been proposed that allowsfor the application of kernel methods to different kinds ofstructured data . This approach is based on the idea of hav-ing a powerful representation that allows for modeling thesemantics of an object by means of the syntax of the rep-resentation . The underlying principle is that of represent-ing individuals as (closed) terms in a typed higher-orderlogic [291 . The typed syntax is important for pruning searchspaces and for modeling as closely as possible the seman-tics of the data in a human- and machine-readable form.The individuals-as-terms representation is a natural gener-alization of the attribute-value representation and collectsall information about an individual in a single term .The key idea is to have a fixed type structure . This typestructure expresses the semantics and composition of indi-viduals from their parts . The type structure is made up byfunction types, product types, and type constructors . Func-tion types are used to represent types corresponding to sets,multisets, and so on . Product types are used to representtypes corresponding to fixed size tuples . Type constructorsare used to represent types corresponding to arbitrary sizestructured objects such as lists, trees, and so on . The setof type constructors also contains types corresponding tosymbols and numbers.To define, for example, a type corresponding to a subsetof {A, B, C, D} one first defines a type constructor of arityzero, corresponding to elements of this set. Then the typecorresponding to a sets of these elements is a function typewhere the function is from the set of elements to the setof Boolean values . The type corresponding to a multi-setof these elements is a function type where the function isfrom the set of elements to the set of natural numbers. Itis important to note that the elements of these sets, lists,etc . are not restricted to be mere symbols but can againbe structured objects . Thus one can, for example, define atype corresponding to a set of lists, or a list of sets.Each type defines a set of terms that represent instances ofthat type, these are called the basic terms. The terms of thelogic are the terms ofthe typed A-calculus, which are formed

3A descendant of a vertex v is any vertex that occurs in thesubtree rooted at v.

in the usual way by abstraction, tupling, and application .Abstraction corresponds to building instances of a functiontype, tupling to building instances of a product type, andapplication to building instances of a type constructor .To this end the biggest difference to terms of a first-orderlogic is the presence of abstractions that allows modelingsets, multisets, and so on. For example, the basic terms s, trepresenting the set {1, 2} and the multiset with 42 occur-rences of A and 21 occurrences of B (and nothing else) are,respectively :

s = Ax.if x = 1 then T else if x= 2 there T else 1t = Ax . if x = A there 42 else if x = B then 21 else 0

Now, we need some additional notation. For a basic ab-straction r, V(r u) denotes the value of r when applied tou, i.e ., V(s 2) = T and V(t C) = 0. The default term isthe value of the abstraction that has no condition, i.e., thedefault term of s is 1 and the default term of t is 0. Thesupport of an abstraction is the set of terms u for whichV(r u) differs from the default term, i.e ., supp(s) = {1,2}and supp(t) = {A,B}. More details of the knowledge rep-resentation can be found in [29] . The basic term kernel [14]is then defined inductively on the structure of terms.If s, t are basic terms formed by application, i.e ., instances ofa type constructor, then s is C s1 . . . s� and t is D t1 . . . t,� ,where C, D are data constructors and si, tj are basic terms.The kernel is then defined as

r-T(C, D)

if CODk(s't)

KT (C, C)+L.~ k(si,ti)i-1

otherwise

If s, t are basic terms formed by abstraction, i.e ., instancesof a function type, then the kernel is defined as

k(s, t) =

F

k(V(s u), V(t v)) - k(u, v) .uEsupp(s)vEsupp(t)

If s, t are basic terms formed by tupling, i.e ., instances ofa product type, then s is (s1, . . . , sn) and t is (ti_ ., tn),where si, t; are basic terms. The kernel is then defined as

k(s, t) _ L k(si, ti),i=1

In [14] additional `modifiers' are described that allow forthe modification of the default kernels in order to reflect thesemantics of the domain better . We will now describe someapplications along with the type definitions .In drug activity prediction it is common to represent theshape of a molecule by measuring the distance from the cen-ter of the molecule to the surface in several directions. Thusa shape can be described by a tuple of real numbers. As,however, the shape of the molecule changes as its energystate changes, each molecule can only be described by a setof shapes (conformations). A molecule is active if one of itsconformations satisfies some conditions, this is known as a,multi-instance' concept. The task of classifying a moleculeas active or inactive given the set of its conformations hasbeen introduced in [6] . In [12] each molecule is representedby a set of conformations (i .e ., an abstraction mapping aconformation to `true' if it is a member of that set and to`false' otherwise), and each conformation is represented by a

Page 8: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

tuple of real numbers. It can be shown that the correspond-ing basic term kernel is correct with respect to support vec-tor machines and multi-instance concepts. In an empiricalevaluation, support vector machines proofed competitive tothe best results achieved in literature .For spatial clustering it is important to gather data pointsin a cluster that are not only spatially close but also demo-graphically similar. Applying a simple clustering algorithmto vectors containing both the spatial and the demographicinformation fails to find spatially compact clusters . Model-ing the spatial and demographic information as two differenttypes and defining the type of the individuals as a functiontype mapping the spatial information to the correspondingdemographic information leads to a kernel function that canbe used by a simple clustering algorithm to find spatiallycompact clusters [141 .To elucidate the structure of molecules, spectroscopic meth-ods such as 13C NMR are frequently used . A spectrumcontains information about the resonance frequency of each(chemically different) carbon atom in the molecule . Addi-tional information can be obtained describing the numberof protons directly connected with each carbon atom (themultiplicity) . The task of predicting the skeleton structureof molecules from their NMR spectrum has been introducedin [9] . A spectrum is modeled as an abstraction mappingeach resonance frequency to the multiplicity of the corre-sponding carbon atom and mapping every other frequencyto zero . A support vector machine using the correspondingkernel function and and a nearest neighbor algorithm usingthe corresponding metric have been shown to outperform allother algorithms applied in literature to this problem.

4.5 GraphsLabeled graphs are widely used in computer science to modeldata . Some of the work described above can be seen askernels on some restricted set of graphs, e.g., on strings or ontrees. In this section we consider recent work [131 on labeledgraphs with arbitrary structure . Such kernel functions can,for example, be useful in learning tasks on molecules .A graph G is described by a finite set of vertices V and afinite set of edges E. For directed graphs, the set of edges isa subset of the Cartesian product of the set of vertices withitself (E C V x V) such that that (vi, v;) E E if and only ifthere is an edge from vi to v; in graph G. For labeled graphsthere is additionally a set of labels along with a functionassigning a label to each edge and/or vertex.The definition of a complete graph kernel is slightly dif-ferent from the definition of complete kernels given above.In general it is not desired that learning algorithms distin-guish between isomorphic graphs, i.e ., graphs that only differin the enumeration of their vertices . Complete graph ker-nels are those positive definite functions on graphs for whichk(G, -) = k(G', -) if and only if G, G' are isomorphic .Using a complete graph kernel and computing k(G, G) -2k(G, G') + k(G, G) one could decide whether G, G' are iso-morphic [131 . As deciding graph isomorphism is suspectedto be computationally hard, one can conclude that no effi-ciently computable complete graph kernel exists . It can alsobe shown that some complete graph kernels are NP-hard tocompute.Consider a graph kernel that has one feature 'DH for eachpossible graph H, each feature 4)H(G) measuring howmanysubgraphs of G have the same structure as graph H. Using

the inner product in this feature space, graphs satisfyingcertain properties can be identified . In particular, one coulddecide whether a graph has a Hamiltonian path [131, i.e .,a sequence of adjacent vertices that contains every vertexexactly once . Now this problem is known to be NP-hard,i .e ., it is strongly believed that this problem can not besolved in polynomial time .The two results given above motivate the search for alterna-tive graph kernels which are less expressive and therefore lessexpensive to compute. Conceptually, the graph kernels pre-sented in [10; 13 ; 21 ; 231 are based on a measure of the walksin two graphs that have some or all labels in common . In[101 walks with equal initial and terminal label are counted,in [21 ; 231 the probability of random walks with equal labelsequences is computed, and in [131 walks with equal labelsequences, possibly containing gaps, are counted.We consider here the work presented in [131 . In this work,efficient computation of these - possibly infinite - walks ismade possible by using the direct product graph and com-puting the limit of matrix power series involving its adja-cency matrix .The two graphs generating the product graph are called thefactor graphs . The vertex set of the direct product of twographs is a subset of the Cartesian product of the vertexsets of the factor graphs . The direct product graph has avertex if and only if the labels of the corresponding verticesin the factor graphs are the same. There is an edge betweentwo vertices in the product graph if there is an edge betweenthe corresponding vertices in both factor graphs and bothedges have the same label . To build its adjacency matrix,assume some arbitrary enumeration {vi} i of the vertices.Each component of the adjacency matrix Ex is defined by[Exlij = 1 -~* (vi, vj) E Ex and [E x l ij = 0 ,#* (vi, vj) J~ E,,,where E, denotes the edge set of the direct product graph.With a sequence of weights ao, Ai, . . . (Ai E R; Ai >_ 0 for alli E RI) the direct product kernel is then defined as

kx(GI,G'z)= Y' ~~ A,17LxJiJ-1

-0

l

ijif the limit exists . Using exponential series (.\i = 0'/i!) orgeometric series (Ai = this kernel can be computedin cubic time . Extensions suggested in [131 include a kernelfor counting label sequences with gaps, and one for handlingtransition graphs, i.e ., graphs with a probability distributionover all edges leaving the same vertex .One interesting application of such graph kernels is an ex-periment in a relational reinforcement learning setting, de-scribed in [111 . In that paper Gaussian processes were ap-plied with graph kernels as the covariance function . Exper-iments were performed in blocks worlds of up to ten blockswith three different goals . In this setting Gaussian processeswith graph kernels proofed competitive or superior to allprevious implementations of relational reinforcement learn-ing algorithms, although it did not use any sophisticatedinstance selection strategy.

5. CONCLUSIONSIt has often been argued that relational data mining andinductive logic programming approaches should have suc-cessful propositional learning algorithms as a special case .Support vector machines are one of the most successful re-

Page 9: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

cent developments within the machine learning area. Thusdeveloping algorithms that can be a applied to structureddata and have support vector machines as a special caseis a promising research direction. This can be achieved bydefining positive definite kernels on structured data.Such positive definite kernels allow algorithms to be appliedto structured data that solve learning problems so far notconsidered by the inductive logic programming community.One example is kernel principal component analysis whichfinds optimal embeddings of structured data into low dimen-sional spaces. Such an embedding can, for example, be usedto visualize the relationship among instances, or to find agood attribute-value representation of the data .Support vector machines have many computational and learn-ing theoretical properties that make them a very interestingand popular learning algorithm. Many of these propertiesfollow from the kernel function being positive definite . Theproblem of finding a hyperplane which is maximally distantfrom points on either side of the hyperplane can be formu-lated as a quadratic program. Such problems are convex ifand only if the matrix they are operating on is positive defi-nite. A convex problem has only one local - and thus global- optimum. This optimum can be found efficiently by wellknown optimization algorithms .In this paper we described several approaches to define pos-itive definite kernel functions on various kinds of structureddata. On the one hand, kernel functions that are applica-ble to similar kinds of data have been compared and theirrespective application areas briefly described. On the otherhand, conceptual differences between approaches have beenclarified and summarized .We believe that investigating kernels on structured data isan important and promising research area . While severalkernels have been defined so far, this research area is stillvery young and there is space for more work .

6. ACKNOWLEDGMENTSResearch supported in part by the EU FrameworkV project(IST-1999-11495) Data Mining and Decision Support for Busi-ness Competitiveness: Solomon Virtual Enterprise and theDFGproject (WR 40/2-1) Hybride Methoden and Systemar-chitekturen fair heterogene Informationsraume. The authorthanks Peter Flach, Tamas Horvath, John Lloyd, and StefanWrobel for valuable discussions and comments .

7. REFERENCES

N. Aronszajn. Theory of reproducing kernels . Transac-tions of the American Mathematical Society, 68, 1950.

[2] K. Bennett and C. Campbell . Support vector machines :Hype or hallelujah? SIGKDD Explorations, 2(2), 2000 .http://www.acm.org/sigs/sigkdd/explorations/issue2-2/bennett.pdf.

B. E. Boser, I. M. Guyon, and V. N. Vapnik . A trainingalgorithm for optimal margin classifiers . In D. Haus-sler, editor, Proceedings of the 5th Annual ACM Work-shop on Computational Learning Theory, pages 144-152. ACM Press, July 1992 .

[4] M. Collins and N. Duffy. Convolution kernels for nat-ural language . In T. G. Dietterich, S. Becker, and

Z. Ghahramani, editors, Advances in Neural Informa-tion Processing Systems, volume 14 . MIT Press, 2002 .

[5] N. Cristianini and J. Shawe-Taylor . An Introductionto Support Vector Machines (and Other Kernel-BasedLearning Methods). Cambridge University Press, 2000.

[6] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez.Solving the multiple instance problem with axis-parallelrectangles . Artificial Intelligence, 89(1-2):31-71, 1997.

[7] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Bio-logical Sequence Analysis : Probabilistic Models of Pro-teins and Nucleic Acids . Cambridge University Press,1998.

[8] S. Dzeroski and N. Lavrac, editors . Relational DataMining. Springer-Verlag, 2001 .

S. Dzeroski, S . Schulze-Kremer, K. Heidtke, K. Siems,D. Wettschereck, and H. Blockeel . Diterpene struc-ture elucidation from '3C NMR spectra with induc-tive logic programming. Applied Artificial Intelligence,12(5):363-383, July-Aug . 1998 . Special Issue on First-Order Knowledge Discovery in Databases.

[10] T. Gartner. Exponential and geometric kernels forgraphs . In NIPS Workshop on Unreal Data: Principlesof Modeling Nonvectorial Data, 2002.

[11] T. Gartner, K. Driessens, and J. Ramon. Graph ker-nels and gaussian processes for relational reinforcementlearning . In Proceedings of the 13th International Con-ference on Inductive Logic Programming, 2003 .

[12] T. Gartner, P. A. Flach, A. Kowalczyk, and A. J.Smola. Multi-instance kernels . In C. Sammut andA. Hoffmann, editors, Proceedings of the 19th International Conference on Machine Learning, pages 179-186.Morgan Kaufmann, June 2002 .

[13] T. Gartner, P. A. Flach, and S. Wrobel . On graph ker-nels : Hardness results and efficient alternatives . In Pro-ceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop,2003 .

[14] T. Gartner, J. W. Lloyd, and P. A. Flach. Kernelsfor structured data . In Proceedings of the 12th Inter-national Conference on Inductive Logic Programming.Springer-Verlag, 2002 .

[15] D. Haussler. Convolution kernels on discrete structures.Technical report, Department of Computer Science,University of California at Santa Cruz, 1999.

[16] T. Jaakkola, M. Diekhans, and D. Haussler . A discrimi-native framework for detecting remote protein homolo-gies . Journal of Computational Biology, 7(1,2), 2000 .

[17] T. Jaakkola and D. Haussler . Exploiting generativemodels in discriminative classifiers . In Advances in Neu-ral Information Processing Systems, volume 10, 1999 .

[18] T. Jaakkola and D. Haussler . Probabilistic kernel re-gression models . In Proceedings of the 1999 Conferenceon AI and Statistics, 1999 .

Page 10: ASurveyofKernelsforStructured Datahomepages.rpi.edu/~bennek/class/mmld/papers/p49-gartner.pdfdata mining [8] aimsto reduce thesepre-processing efforts byconsidering learning from multi-relational

[19] T. Joachims . Learning to Classify Text using SupportVector Machines . Kluwer Academic Publishers, 2002 .

[20] R. Karchin, K. Karplus, and D. Haussler . Classifyingg-protein coupled receptors with support vector ma-chines . Bioinformatics, 18(1):147-159, 2002.

[21] H. Kashima and A. Inokuchi . Kernels for graph classi-fication . In ICDM Workshop on Active Mining, 2002 .

[22] H. Kashima and T. Koyanagi . Kernels for semi-structured data. In C. Sammut and A. Hoffmann, ed-itors, Proceedings of the 19th International Conferenceon Machine Learning . Morgan Kaufmann, 2002.

[23] H. Kashima, K. Tsuda, and A. Inokuchi . Marginalizedkernels between labeled graphs . In Proceedings of the20th International Conference on Machine Learning,2003.

[24] R. I . Kondor and J. Lafferty. Diffusion kernels on graphsand other discrete input spaces. In C. Sammut andA. Hoffmann, editors, Proceedings of the 19th International Conference on Machine Learning, pages 315-322.Morgan Kaufmann, 2002 .

[25] S. Kramer, N. Lavrac, and P. A. Flach. Proposi-tionalization approaches to relational data mining . InDzeroski and Lavrac [8], chapter 11 .

[26] M.-A. Krogel and S. Wrobel . Transformation-basedlearning using multirelational aggregation . In C. Rou-veirol and M. Sebag, editors, Proceedings of the 11thInternational Conference on Inductive Logic Program-ming . Springer-Verlag, 2001 .

[27] C. Leslie, E. Eskin, and W. Noble. The spectrum ker-nel: A string kernel for svm protein classification . InProceedings of the Pacific Symposium on Biocomput-ing, pages 564-575, 2002 .

[28] C. Leslie, E. Eskin, J. Weston, and W. Noble. Mis-match string kernels for svm protein classification . InS. Becker, S . Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, vol-ume 15 . MIT Press, 2003 .

[29] J. W. Lloyd. Logic for Learning. Springer-Verlag, 2002 .

[30] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini,and C. Watkins. Text classification using string kernels .Journal of Machine Learning Research, 2, 2002 .

[31] H. Lodhi, J. Shawe-Taylor, N. Christianini, andC. Watkins. Text classification using string kernels . InT. Leen, T. Dietterich, and V. Tresp, editors, Advancesin Neural Information Processing Systems, volume 13 .MIT Press, 2001 .

[32] K.-R. Miiller, S. Mika, G. Ratsch, K. Tsuda, andB. Scholkopf. An introduction to kernel-based learningalgorithms . IEEE Transactions on Neural Networks,2(2), 2001 .

[33] G. Paass, E. Leopold, M. Larson, J. Kindermann,and S. Eickeler . Svm classification using sequences ofphonemes and syllables. In T. Elomaa, H. Mannila, andH. Toivonen, editors, Proceedings of the 6th EuropeanConference on Principles of Data Mining and Knowl-edge Discovery, pages 373-384. Springer-Verlag, 2002 .

[34] P. Pavlidis, T. Furey, M. Liberto, D. Haussler, andW. Grundy. Promoter region-based classification ofgenes. In Proceedings of the Pacific Symposium on Bio-computing, pages 151-163, 2001 .

[35] L. R. Rabiner. A tutorial on hidden Markov models andselected applications in speech recognition . Proceedingsof the IEEE, 77(2):257-285, Feb. 1989 .

[36] C. Saunders, J. Shawe-Taylor, and A. Vinokourov .String kernels, fisher kernels and finite state automata.In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, vol-ume 15. MIT Press, 2003 .

[37] B. Scholkopf and A. J. Smola. Learning with Kernels.MIT Press, 2002 .

[38] N. Smith and M. Gales. Speech recognition usingSVMs . In T. Dietterich, S . Becker, and Z. Ghahra-mani, editors, Advances in Neural Information Process-ing Systems, volume 14 . MIT Press, 2002 .

[39] K. Tsuda, M. Kawanabe, G. Ratsch, S. Sonnenburg,and K.-R. Miiller. A new discriminative kernel fromprobabilistic models. In T. Dietterich, S. Becker, andZ. Ghahramani, editors, Advances in Neural Informa-tion Processing Systems, volume 14 . MIT Press, 2002 .

[40] K. Tsuda, T. Kin, and K. Asai . Marginalized kernelsfor biological sequences . Bioinformatics, 2002 .

[41] V. Vapnik. The Nature of Statistical Learning Theory.Springer-Verlag, 1995 .

[42] J.-P . Vert . A tree kernel to analyze phylogenetic pro-files. Bioinformatics, 2002 .

[43] J.-P. Vert and M. Kanehisa. Graph driven features ex-traction from microarray data using diffusion kernelsand kernel cca. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Pro-cessing Systems, volume 15 . MIT Press, 2003 .

[44] S. Vishwanathan and A. Smola. Fast kernels for stringand tree matching . In S . Becker, S. Thrun, andK. Ober-mayer, editors, Advances in Neural Information Pro-cessing Systems, volume 15 . MIT Press, 2003 .

[45] C. Watkins. Dynamic alignment kernels . Technical re-port, Department of Computer Science, Royal Hol-loway, University of London, 1999 .

[46] C. Watkins. Kernels from matching operations . Tech-nical report, Department of Computer Science, RoyalHolloway, University of London, 1999 .

[47] A, Zien, G. Ratsch, S. Mika, B. Scholkopf, T. Lengauer,and K.-R. Muller . Engineering support vector machinekernels that recognize translation initiation sites. Bioin-forrnatics, 16(9):799-807, 2000 .