francesco camastra- kernel methods for clustering

Upload: roots999

Post on 06-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    1/45

    Kernel Methods for Clustering

    Francesco Camastra

    [email protected]

    DISI, Universita di Genova

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    2/45

    Talk Outlines

    Preliminaries (Unsupervised Learning, Kernel Methods)

    Kernel Methods for Clustering

    Experimental Results

    Conclusions and Future Work

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    3/45

    The Learning Problem

    The learning problemcan be described as finding a general rule (description)

    that explains data given only a sample of limited size.

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    4/45

    Learning Algorithms

    Learning algorithms can be grouped in three big families:

    Supervised algorithms

    Reinforcement Learning algorithms

    Unsupervised algorithms

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    5/45

    Supervised Algorithms

    If data is a sample of input-output patterns, a data description is a function that

    produces the output given the input.

    The learning is called supervised because target values(e.g. classes, real values) are associated with data.

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    6/45

    Unsupervised Algorithms

    If data is only a sample of objects without associated target values, the

    problem is known as unsupervised learning.

    A data description can be:

    a set of clusters or a probability density function stating the probability to

    observe a certain object in the future.

    a manifold that contains all data without information loss (manifold

    learning).

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    7/45

    Kernel Methods

    Kernel Methods are algorithms that implicitly perform,

    by replacing the inner product with an appropriate Mercer Kernel, a nonlinear

    mapping of the input data to a high dimensional Feature Space.

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    8/45

    Mercer Kernel

    We call the kernel G a Mercer kernel(or positive definite kernel) if and only if is

    symmetric (i.e G(x, y) = G(y, x) x, y X) andn

    j,k=1

    cjckG(xj , xk) > 0

    for all n 2, x1, . . . , xn X and c1, . . . , cn R.Each kernel G(x, y) can be represented as:

    G(x, y) = (x)

    (y) : X F

    .

    Fis called Feature Space. If is known, the mapping is explicit, otherwise isimplicit.

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    9/45

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    10/45

    Mercer Kernel Examples

    Square Kernel S(x, y) = (x y)2Square Kernel mapping is explicit.

    If we consider x = (x1

    , x2

    ) y = (y1

    , y2

    ) we have:

    S(x, y) = (x1y1 + x2y2)2

    = x12y1

    2 + 2x1x2y1y2 + x22y2

    2

    Therefore is:

    (x) = (x12, x2

    2,

    2x1x2) : R2 R3

    (in the n-dimensional case x Rn, : Rn Rn(n+1)2 )

    Gaussian Kernel G(x, y) = exp( xy22

    )

    Gaussian kernel mapping is implicit.

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    11/45

    Distance in Feature Space

    Given two vectors x and y, we remember that G(x, y) = (x) (y)Therefore it is always possible to compute their distance in the Feature

    Space:

    (x) (y)2 = ((x) (y)) ((x) (y))= (x)

    (x)

    2(x)

    (y) + (y)

    (y)

    = G(x, x) 2G(x, y) + G(y, y)

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    12/45

    Clustering Methods

    Following Jains approach (Jain et al., 1999), clustering methods can be

    categorized into:

    Hierarchical ClusteringHierarchical schemes sequentially build nested clusters with a graphical

    representation (dendrogram).

    Partitioning Clustering

    Partitioning methods directly assign all the data points according to some

    appropriate criteria (e.g. similarity and density) into different groups (clusters).

    The Research has been focused on the prototyped-based clustering

    algorithms, which is the most popular class of Partitioning Clustering Methods.

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    13/45

    Clustering: Some Definitions

    Let D be a data set, whose cardinality is m, formed by vectors Rn. TheCodebook is the set W = (w1, w2, . . . , wk) where each element (codevector)

    wc Rn

    and k m.The Voronoi set (Vc) of the codevector wc is the set of all vectors in D for which

    wc is the nearest codevector

    Vc = { D | c = arg minj wj}A codebook is optimal if minimizes the Quantization error J:

    J = 12|D|

    Vc

    mj=1

    wc2

    where |D| is the cardinality of D.

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    14/45

    K-Means (Lloyd, 1957)

    1. Initialize the codebook W = {w1, w2, . . . , wK} with K vectors chosen fromthe data set D.

    2. Compute for each codevector wi W its Voronoi Set ViVi = { D | i = arg min

    j wj}

    3. Move each codevector wi to the mean of its Voronoi Set. wi =1|Vi|

    Vi

    4. Go to step 2 if any codevector wi changes otherwise return the codebook.

    C

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    15/45

    Kernel Methods for Clustering

    Methods that kernelise the metric (Yu et al. 2002), i.e. the metric is

    computed by means of a Mercer Kernel in a Feature Space.

    Kernel K-Means (Girolami, 2002)

    Kernel methods based on support vector data description (Camastra and

    Verri, 2005)

    K li i h i

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    16/45

    Kernelising the metric

    Given x, y Rn the metric dG(x, y) in the Feature Space is:

    dG(x, y) =

    (x)

    (y)

    = (G(x, x)

    2G(x, y) + G(y, y))

    12 .

    Given a dataset D = (xi Rn, i = 1, 2, . . . , m), the goal is to get a codebookW = (wi Rn, i = 1, 2, . . . , K ) that minimizes the quantization error E(D)

    E(D) = 12|D|

    mc=1

    xiVc

    xi wc2

    where Vc is the Voronoi set of the codebook wc.

    Hence we can think to compute the metrics in the Feature Space, i.e. we have:

    EG(D) =1

    2

    |D

    |

    m

    c=1xiVc

    G(xi, xi) 2G(xi, wc) + G(wc, wc)

    K li i th t i

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    17/45

    Kernelising the metric (cont.)

    A naive solution to minimize EG(D) consists in computingEG(D)wc

    and using a

    steepest gradient descent algorithm. Hence some classical clustering

    algorithms can be kernelised. For instance, we consider online K-MEANS. Itslearning rule is

    wc = ( wc) = E(D)wc

    where is the input vector, wc is the winner codevector for . Hence it can be

    rewritten as:

    wc = EG(D)

    wc

    In the case of G(x, y) = exp( wc22

    ) the equation becomes

    wc = (

    wc)

    2 exp wc2

    2

    O Cl SVM

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    18/45

    One-Class SVM

    One-Class SVM (1SVM) (Tax and Duin, 1999)(Scholkopf et al., 2001)

    searches the hypersphere in the Feature space Fwith centre a and minimalradius R containing most data. The problem can be expressed as:

    minR,a,

    R2 + Ci

    i

    subject to(xi) a2 R2 + i and i 0 i = 1, . . . , mwhere x1, . . . , xl is the data set. To solve the problem the Lagrangian L isintroduced:

    L = R2 j

    (R2 + j (xi) a2)j j

    jj + Cj

    j

    where j 0 and j 0 are Lagrange multipliers and C is constant.

    1SVM

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    19/45

    1SVM (cont.)

    Setting to zero the derivatives of L w.r.t R, a and j and substituting, we turnthe Lagrangian into the Wolfe dual form W:

    W = j

    (xj) (xj)j i

    j

    ij(xi) (xj)

    =

    jG(xj , xj)j

    i jijG(xi, xj)

    withj

    j = 1 and 0 j C

    j = 0

    (xj) in the sphere

    j = C (xj) are outside the sphere.0 < j < C (xj) in the surface of the sphere.These points are called support vectors

    1SVM ( )

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    20/45

    1SVM (cont.)

    The center a is: a =j

    j(xj)

    Therefore the center position can be unknown.

    Nevertheless the distance R(x) of a point (x) from the center a can be

    always computed:

    R2(x) = G(x, x)

    2j

    j

    G(xj

    , x) +i

    j

    ij

    G(xi, x

    j)

    The Gaussian is a usual choice for the kernel G().

    Camastra Verri Algorithm: Definitions

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    21/45

    Camastra-Verri Algorithm: Definitions

    Given a data set D, we map the data in a Feature Space F.We consider K centers (ai, F, i = 1, . . . , K ). We call the setA = (a1, . . . , aK) Feature Space Codebook.

    We define for each center ac its Voronoi Set in Feature Space

    F Vc = {xi D c = arg minj(xi) aj}

    Camastra Verri Algorithm: Strategy

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    22/45

    Camastra-Verri Algorithm: Strategy

    Our algorithm uses a K-Means-like strategy, i.e. it moves repeatedly the

    centers, computing for each center 1SVM, until any center changes.

    To make more robust the algorithm with respect to the outliers 1SVM is

    computed on F Vc() of each center ac

    F Vc() = {xi F Vc and (xi) ac < }

    F Vc() can be seen the Voronoi set in the Feature Space of the center ac

    without outliers.

    The parameter can be set up using model selection techniques.

    The Algorithm

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    23/45

    The Algorithm

    1. Project the data Set D in a Feature Space F, by means a nonlinearmapping . Initialize the centers ac c = 1, . . . , K ac F

    2. Compute for each center ac F Vc()

    3. Apply 1SVM to each F Vc() and assign to ac the center yielded, i.e.

    ac = 1SV M(F Vc())

    4. Go to step 2 until any ac changes

    5. Return the Feature Space codebook.

    Kernel K Means (Girolami 2002)

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    24/45

    Kernel K-Means (Girolami, 2002)

    1. Project the data Set D in a Feature Space F, by means a nonlinearmapping . Initialize the centers ac c = 1, . . . , K ac F

    2. Compute for each center ac F Vc

    3. Move each codevector ai to the mean of its Feature Voronoi Set.

    ai =1

    |FVi| FVi

    ()

    4. Go to step 2 until any ac changes otherwise return the Feature Space

    codebook.

    Kernel K Means (cont )

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    25/45

    Kernel K-Means (cont.)

    Works even we do not know

    We are always able to compute the distance of any point (x)from any

    centroid ac. After some maths, we have

    (x) ac2 = R2(x) = G(x, x) 2j

    G(xj , x) +i

    j

    G(xi, xj)

    Hence even we do not know we are always able to compute Feature

    Voronoi Set

    Experiments with Camastra-Verri algorithm

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    26/45

    Experiments with Camastra-Verri algorithm

    Synthetic Data Set (Delta Set)

    Iris Data(Fisher, 1936)

    Wisconsin breast cancer database

    (Wolberg and Mangasarian, 1990)

    Spam data

    K-Means (Lloyd 1957), Self Organizing Map (Kohonen, 1982), Neural Gas

    (Martinetz et al., 1992), Ng-Jordan Algorithm (Ng et al., 2001) and Our

    Algorithm have been tried.

    Delta Set: K-Means

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    27/45

    Delta Set: K-Means

    0 0.2 0.4 0.6 0.8 1-1

    -0.5

    0

    0.5

    1

    Delta Set: Our Algorithm

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    28/45

    Delta Set: Our Algorithm

    0 0.2 0.4 0.6 0.8 1

    -1

    -0.5

    0

    0.5

    1

    Delta Set: Our Algorithm (I iteration)

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    29/45

    Delta Set: Our Algorithm (I iteration)

    0 0.2 0.4 0.6 0.8 1

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    Delta Set: Our Algorithm (II iteration)

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    30/45

    Delta Set: Our Algorithm (II iteration)

    0 0.2 0.4 0.6 0.8 1

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Delta Set: Our Algorithm (III iteration)

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    31/45

    Delta Set: Our Algorithm (III iteration)

    0 0.2 0.4 0.6 0.8 1

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Delta Set: Our Algorithm (IV iteration)

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    32/45

    Delta Set: Our Algorithm (IV iteration)

    0 0.2 0.4 0.6 0.8 1

    -0.5

    -0.25

    0

    0.25

    0.5

    0.75

    1

    Delta Set: Our Algorithm (V iteration)

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    33/45

    Delta Set: Our Algorithm (V iteration)

    0 0.2 0.4 0.6 0.8 1

    -0.75

    -0.5

    -0.25

    0

    0.25

    0.5

    0.75

    1

    Delta Set: Our Algorithm (VI iteration)

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    34/45

    g ( )

    0 0.2 0.4 0.6 0.8 1-1

    -0.5

    0

    0.5

    1

    Iris Data

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    35/45

    Iris data is formed by 150 data points of three different classes. One class

    (setosa) is linearly separable from the other two (versicolor, virginica), but

    the other two are not linearly separable from each other.

    Iris Data Dimension is 4.

    K-Means, Self Organizing Map (SOM), Neural Gas, Ng-Jordan algorithm

    and Our Algorithm have been tried.

    Experimentations have been performed using three codevectors.

    Iris Data

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    36/45

    -9 -8 -7 -6 -5 -4 -3

    -6.5

    -6

    -5.5

    -5

    -4.5

    -4

    Iris Data: K-Means

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    37/45

    -9 -8 -7 -6 -5 -4 -3

    -6.5

    -6

    -5.5

    -5

    -4.5

    -4

    Iris Data: Camastra-Verri algorithm

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    38/45

    g

    -9 -8 -7 -6 -5 -4 -3

    -6.5

    -6

    -5.5

    -5

    -4.5

    -4

    Iris data: Results

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    39/45

    model Points Classified Correctly

    SOM 121.5 1.5 (81.0%)K-Means 133.5 0.5 (89.0%)

    Neural Gas 137.5 1.5 (91.7%)Ng-Jordan Algorithm 126.5

    7.5 (84.3%)

    Our Algorithm 142 1 (94.7%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm

    performances on IRIS Data. The results have been obtained using twenty

    different runs for each algorithm.

    Wisconsin Data

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    40/45

    Wisconsin breast cancer data is formed by 699 patterns (patients) of two

    different classes. The classes are not linearly separable from each other.

    The database considered in the experiments has 683 samples since we

    have removed 16 patterns with missing values.

    Wisconsin Data Dimension is 9. K-Means, Self Organizing Map (SOM),

    Neural Gas, Ng-Jordan algorithm and Our Algorithm have been tried.

    Experimentations have been performed using two codevectors.

    Wisconsin database: Results

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    41/45

    model Points Classified Correctly

    K-Means 656.5 0.5 (96.1%)Neural Gas 656.5 0.5 (96.1%)

    SOM 660.5 0.5 (96.7%)Ng-Jordan Algorithm 652

    2 (95.5%)

    Our Algorithm 662.5 0.5 (97.0%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm

    performances on Winscosins breast cancer database.

    The results have been obtained using twenty different runs for each algorithm.

    Spam Data

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    42/45

    Spam Data is formed by 1534 patterns of two different classes (spam and

    not-spam). The classes are not linearly separable from each other.

    Spam Data Dimension is 57.

    K-Means, Self Organizing Map (SOM), Neural Gas, Ng-Jordan algorithm

    and Our Algorithm have been tried.

    Experimentations have been performed using two codevectors.

    Spam data: Results

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    43/45

    model Points Classified Correctly

    K-Means 1083 153 (70.6%)Neural Gas 1050 120 (68.4%)

    SOM 1210 30 (78.9%)Ng-Jordan Algorithm 929

    0 (60.6%)

    Our Algorithm 1247 3 (81.3%)Average Ng-Jordan algorithm, SOM, K-Means, Neural Gas and our algorithm

    performances on Spam data.

    The results have been obtained using twenty different runs for each algorithm.

    Conclusions and Future Works

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    44/45

    Our algorithm performs better than K-Means, Som, Neural Gas and

    Ng-Jordan algorithm on a synthetic data set and three UCI benchmarks

    (Iris data, Wisconsin Breast Cancer Database, Spam Database)

    Future efforts will be devoted to the application of our algorithm to

    computer vision problems (e.g. color image segmentation)

    At present we are investigating as kernel methods can be generalized interms of fuzzy logic (kernel-fuzzy methods).

    Finally, experimental comparisons between our algorithm and Girolamis

    algorithm are in progress.

    Dedication

  • 8/3/2019 Francesco Camastra- Kernel Methods for Clustering

    45/45

    To my mother, Antonia Nicoletta Corbascio, in the mostdifficult moment of her life.