string kernels

Upload: srirangam-ramaraj

Post on 03-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 String Kernels

    1/38

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 1

    String and Tree KernelsAlgorithms and Applications

    S.V.N. Vishy [email protected]

    National ICT of Australiaand

    Indian Insitutute of Science

    Bangalore, India

    Joint work with Alex Smola

  • 7/28/2019 String Kernels

    2/38

    Overview - I

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 2

    Motivation

    Exact Kernels on Strings

    Definition and Examples

    Suffix Trees

    DefinitionMatching Statistics

    Counting Substrings

    Weights and Kernels

    AnnotationWeighting Functions

    Linear Time Prediction

  • 7/28/2019 String Kernels

    3/38

    Overview - II

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 3

    Sliding Windows

    Position Dependent Weights

    Kernels on Trees

    Definition

    Sorting Trees

    Tree to String Conversion

    Coarsening Levels

    Inexact Kernels on Strings

    DefinitionMismatch Kernel

    Space Time Tradeoffs

    Extensions and Future Work

  • 7/28/2019 String Kernels

    4/38

    Motivation

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 4

    Kernel Methods

    Exciting theoretical boundsNumerous applications

    Limited to vectorial data

    Strings

    Bio-informatics

    Spam filtering

    Internet search engines

    String KernelsMust respect structure

    Fast and easy to compute

    Semantically meaningful

  • 7/28/2019 String Kernels

    5/38

    Notation

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 5

    Alphabet

    Set of characters denoted by AString

    Any x Ak for some k

    Sentinel

    Some $ / A used to terminate a string

    Prefix/Suffix/Substring

    Let x = uvw for some possibly empty u, v and w

    u is called the prefix and w the suffix of xv is called a substring of x

    We write v x

  • 7/28/2019 String Kernels

    6/38

    String Kernels - I

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 6

    Exact Matching Kernels

    k(x, x) :=

    sx,sxwss,s =

    sA

    nums(x)nums(x)ws.

    Count all matching substrings

    Flexible weighting schemeDifferent applications = different weights

    Noise in training data = does not work :-(

    Successful applications in bio-informatics (Vishwanathan

    and Smola, 2002) (Leslie et. al., 2002)

    Linear time algorithms using suffix trees

  • 7/28/2019 String Kernels

    7/38

    Exact String Kernels

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 7

    Bag of Characters

    Counts single characters (Joachims, 1999). Set ws = 0 forall |s| > 1

    Bag of Wordss is bounded by whitespace (Joachims, 1999)

    Limited Range CorrelationsSet ws = 0 for all |s| > n given a fixed n

    K-spectrum kernelAccount for matching substrings of length k (Leslie et al.,

    2002). Set ws = 0 for all |s| = kGeneral Case

    Quadratic time kernel computation (Haussler, 1998,Watkins, 1998), cubic time prediction

  • 7/28/2019 String Kernels

    8/38

    Suffix Trees

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 8

    Definition

    Compact tree built from all the suffixes of a word. Suffixtree of ababc denoted by S(ababc).

    /.-,()*+ab

    wwoooooooooooooo

    b cccc

    cccc c$

    **

    /.-,()*+ 22 abc$

    c$

    cccc

    cccc

    /.-,()*+abc$

    c$

    cccc

    cccc

    /.-,()*+

    /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

    Node label := unique path from the root

    Suffix links are used to speed up parsing of strings

    Suppose we are at a node ax then suffix links help us tojump to node x

    Each internal node has a unique suffix link (McCreight,76)

  • 7/28/2019 String Kernels

    9/38

    Suffix Trees Contd. . .

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 9

    Properties

    Represents all the substrings of the given stringCan be constructed in linear time

    Offline algorithms - (Weiner 73) and (McCreight 76)

    Online algorithm - (Ukkonen 93)

    Can be stored using linear spaceEdges are encoded by indices of the substring

    Each leaf corresponds to a unique suffix

    Leaves on subtree give number of occurrences

    Each internal node has at least 2 distinct childrenAnnotation by performing DFS is easy

    LCA queries in constant time (linear pre-processing)

  • 7/28/2019 String Kernels

    10/38

    Matching Statistics

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 10

    Definition

    Given strings x, y with |x| = n and |y| = m, the matchingstatistics of x with respect to y are defined by v, c Nn,where

    vi is the length of the longest substring of y matching aprefix of x[i : n]

    vi := i + vi 1

    ci is a pointer to ceil(x[i : vi]) in S(y).

    ceil(s) is the last node on the path from the root to s

    This can be computed in linear time (Chang and Lawler,1994).

  • 7/28/2019 String Kernels

    11/38

    Example

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 11

    Matching statistic of abba with respect to S(ababc).

    String a b b avi 2 1 2 1

    ceil(ci) a b b b root

    /.-,()*+ab

    wwoooooooooooooo

    b cccc

    cccc c$

    **

    /.-,()*+ 22 abc$

    c$

    ccc

    cccc

    c/.-,()*+

    abc$

    c$

    ccc

    cccc

    c/.-,()*+

    /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

  • 7/28/2019 String Kernels

    12/38

    Matching Substrings

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 12

    Prefixes

    w is a substring of x iff there is an i such that w is a prefixof x[i : n]. The number of occurrences of w in x can becalculated by finding all such i.

    SubstringsThe set of matching substrings of x and y is the set of allprefixes of x[i : vi].

    Next StepIf we have a substring w of x, prefixes of w may occur inx with higher frequency. We need an efficient computation

    scheme.

  • 7/28/2019 String Kernels

    13/38

    Key Trick - I

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 13

    Theorem

    Let x and y be strings and c and v be the matching statis-tics of x with respect to y. Assume that

    W(y, t) =

    sprefix(v)

    wus wu where u = ceil(t) and t = uv.

    can be computed in constant time for any t. Then k(x, y)can be computed in O(|x| + |y|) time as

    k(x, y) =

    |x|

    i=1

    val(x[i : vi])

    where val(s) indicates the contribution to the kernel due tostring s and all its prefixes.

    K T i k II

  • 7/28/2019 String Kernels

    14/38

    Key Trick - II

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 14

    Observation

    All substrings ending on the same edge of s(y) occur thesame number of times in string y

    For any matching substring t we can write

    val(t) := lvs(floor(t)) W(y, t) + val(ceil(t))

    For each node v we can pre-compute val(v) by a simpleDFS on the suffix tree

    We can compute in constant time

    val(x[i : vi]) = lvs(floor(x[i : vi])) W(y, x[i : vi]) + val(ci)

    C ti

  • 7/28/2019 String Kernels

    15/38

    Computing W(y, t)

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 15

    Length-Dependent Weights

    Assume that ws = w|s|, then

    W(y, t) =

    |t|j=| ceil(t)|

    wj w| ceil(t)| = |t| | ceil(t)|

    where j :=

    ji=1 wj, which can be pre-computed andstored for j = 1, 2, . . . max(|x|, |y|).

    K-spectrum KernelIt is easy to see that

    W(y, t) =

    1 if | ceil(t)| < k and |t| k0 if otherwise

    G i W i ht

  • 7/28/2019 String Kernels

    16/38

    Generic Weights

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 16

    TFIDF Weights

    Assume ws = (freq(s))(|s|)Strings on the same edge = same frequency

    W(y, t) = (freq(t))

    |t|

    i=| ceil(t)|+1

    (i)

    To take into account frequency in entire training set builda master suffix tree of all strings in the training set

    In case weights are completely arbitrary it suffices to

    annotate all the nodes of this master suffix tree (Vish-wanathan and Smola, 2002).

    Th E ti ti P bl

  • 7/28/2019 String Kernels

    17/38

    The Estimation Problem

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 17

    ProblemFor prediction we need to compute

    f(x) =

    i ik(xi, x).

    This depends on the number of SVs

    Web search engines and spam filtering

    Large number of SVs

    Real time prediction is criticalKey Observation

    We are repeatedly parsing the SV strings

    Pre-processing the SV set can speed up prediction

    IdeaWe can merge matching weights from all the SVs. All weneed is a master suffix tree!

    Master S ffi Tree

  • 7/28/2019 String Kernels

    18/38

    Master Suffix Tree

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 18

    A collection of strings X = {x1, x2 . . . xm}

    We define S(X) as a natural union of S(xi)Define X := x1$1x2$2 . . . xm$m

    Construct S(X)

    Prune away the extra labels on the leaves

    The $is induce a extra log penalty in construction time

    Aamir et. al. algorithm avoids this penalty

    It uses a clever modification of the McCreight algorithm

    We can construct the master suffix tree S(X) in O(

    i |xi|)time

    Estimation Algorithm

  • 7/28/2019 String Kernels

    19/38

    Estimation Algorithm

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 19

    A substring s in a Support Vector xi is counted (i.e. con-

    tributes a weight value) i times to the kernelWe merge the suffix trees of all the Support Vectors into amaster suffix tree

    In the master suffix tree we associate weight i with each

    leaf derived from Support Vector xiFor a node v, we define lvs(v) as the sum of weights asso-ciated with the subtree rooted at v

    The key theorem can now be applied unchanged to com-

    pute f(x)Our algorithm runs in time linear in the size of x and isindependent of the size of the Support Vector set!

    Sliding Windows

  • 7/28/2019 String Kernels

    20/38

    Sliding Windows

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 20

    Motivation

    Bioinformatics: Estimate on windows of a long stringProblem

    Long sequence x and a window width N are given

    Estimate the function value for windows of length N

    Observation

    Interestingmatches = cross a window boundary

    Algorithm

    Compute the matching statistics for xAccount for cross boundary matches (at most N)

    In practice very few matches cross the boundary

    Expected sub-linear time for sliding a window!

    A Picture Helps

  • 7/28/2019 String Kernels

    21/38

    A Picture Helps

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 21

    Positional Weights

  • 7/28/2019 String Kernels

    22/38

    Positional Weights

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 22

    Motivation

    Bioinformatics: weigh exons and introns differentlyDefinition

    k(x, x) :=

    sx,sx

    w(s,x)(s,x)s,s

    where w(s,x) and (s,x) are position dependent.Observation

    Each leaf of S(x) corresponds to a unique suffix of x

    Algorithm

    Assign a different weight to each leaf

    As before sum the weights on each subtree

    The original algorithm runs unchanged!

    Tree Kernels

  • 7/28/2019 String Kernels

    23/38

    Tree Kernels

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 23

    Subset TreesSet of connected nodes of a tree T

    Definition (Colins and Duffy, 2001)Denote by T, T trees and by t |= T a subset tree of T, then

    k(T, T) = t|=T,t|=T

    wtt,t.

    Our Definition (Vishwanathan and Smola, 2002)In case we count matching subtrees then t |= T denotesthat t is a subtree of T and we get

    k(T, T) =

    t|=T,t|=T

    wtt,t.

    Permutation Invariance

  • 7/28/2019 String Kernels

    24/38

    Permutation Invariance

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 24

    ProblemWe want permutation invariance of unordered trees.

    Example:The following two unordered trees are mirror images

    /.-,()*+

    ccc

    cccc

    c/.-,()*+

    ccc

    cccc

    c

    /.-,()*+ /.-,()*+

    ccc

    cccc

    c /.-,()*+

    ccc

    cccc

    c /.-,()*+

    /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

    Solution

    Sort trees before computing kernel

    Maps equivalent trees to a single representative

    Sorting Trees - I

  • 7/28/2019 String Kernels

    25/38

    Sorting Trees - I

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 25

    Sorting Rules

    Assume existence of lexicographic order on labelsIntroduce symbols [, ] satisfy [< ], and that ], [==========================

  • 7/28/2019 String Kernels

    35/38

  • 7/28/2019 String Kernels

    36/38

    Results

  • 7/28/2019 String Kernels

    37/38

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 37

    Summary and Extensions

  • 7/28/2019 String Kernels

    38/38

    y

    S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 38

    Reduction from quadratic or cubic to linear prediction andkernel computation time

    Kernels on heaps, stacks, bags, etc. trivial

    Compact storage of SVs if redundancies abound in SVset. E.g. for anagram and analphabet we need onlyanalphabet and gram

    Ideas can also be extended to suffix arrays

    Approximate matching and wildcards

    Can we do better in case of approximate matches?

    Automata and dynamical systems

    Do expensive things with string kernel classifiers

    More information at h t t p : / / w w w . k e r n e l - m a c h i n e s . o r g

    http://www.kernel-machines.org/