1836425.pdf

Upload: wulan-jessica

Post on 10-Oct-2015

13 views

Category:

Documents


0 download

TRANSCRIPT

  • Agglomerative Hierarchical Clustering Using Asymmetric Similarity

    Paper:

    Agglomerative Hierarchical Clustering Without Reversals onDendrograms Using Asymmetric Similarity Measures

    Satoshi Takumi and Sadaaki MiyamotoDepartment of Risk Engineering, School of Systems and Information Engineering, University of Tsukuba

    1-1-1 Tennodai, Tsukuba, Ibaraki 305-8577, JapanE-mail: [email protected]

    [Received November 29, 2011; accepted September 25, 2012]

    Algorithms of agglomerative hierarchical clusteringusing asymmetric similarity measures are studied.Two different measures between two clusters are pro-posed, one of which generalizes the average linkagefor symmetric similarity measures. Asymmetric den-drogram representation is considered after foregoingstudies. It is proved that the proposed linkage meth-ods for asymmetric measures have no reversals in thedendrograms. Examples based on real data show howthe methods work.

    Keywords: agglomerative clustering, asymmetric simi-larity, asymmetric dendrogram

    1. Introduction

    Cluster analysis alias clustering has now become a stan-dard tool in modern data mining and data analysis. Clus-tering techniques are divided into two classes of hierarchi-cal and non-hierarchical methods. The major technique inthe first class is the well-known agglomerative hierarchi-cal clustering [1, 2] which is old but has been found usefulin a variety of applications.

    Agglomerative hierarchical clustering uses a similar-ity or dissimilarity measure between a pair of objects tobe clustered, and the similarity/dissimilarity is assumedto have symmetric property. In some real applications,however, relation between objects are asymmetric, e.g.,citation counts between journals and import of goods be-tween two countries. In such cases we have a motivationto analyze asymmetric measures and obtain clusters hav-ing asymmetric features.

    Not many but several studies have been done on cluster-ing based on asymmetric similarity measures. Hubert [3]defined clusters using the concept of the connectivity ofasymmetric weighted graphs. Okada and Teramoto [4]used the mean of asymmetric measures with an asym-metric dendrogram. Yadohisa [5] studied the generalizedlinkage method of asymmetric measures with a variationof asymmetric dendrogram representing two levels on abranch.

    We propose two new linkage methods for asymmetricsimilarity measures in this paper. A method is a gener-

    alization of the average linkage for symmetric similarityand another is a model-dependent method having the con-cept of average citation probability from a cluster to an-other cluster. As the asymmetric dendrogram herein, weuse a variation of that by Yadohisa [5]. We also prove thatthe proposed methods have no reversals in the dendro-gram. In order to observe how proposed methods work,we show examples based on three real data sets.

    2. Agglomerative Hierarchical Clustering

    We first review the general procedure of agglomera-tive hierarchical clustering and then introduce asymmetricsimilarity measures.

    Let the set of objects for clustering be X = {x1, . . . ,xN}.Generally a cluster denoted by Gi is a subset of X . Thefamily of clusters is denoted by

    G = {G1,G2, . . . ,GK},where the clusters form a crisp partition of X :

    K

    i=1Gi = X , GiGj = /0 (i = j). . . (1)

    Moreover the number of objects in G is denoted by |G|.Agglomerative hierarchical clustering uses a similarity

    or dissimilarity measure. We use similarity here: similar-ity between two objects x,y X is assumed to be givenand denoted by s(x,y). Similarity between two clusters isalso used, which is denoted by s(G,G) (G,G G ) whichalso is called an inter-cluster similarity.

    In the classical setting a similarity measure is assumedto be symmetric:

    s(G,G) = s(G,G).

    Let us first describe a general procedure of agglomera-tive hierarchical clustering [6, 7].

    AHC (Agglomerative Hierarchical Clustering) Algo-rithm:AHC1: Assume that initial clusters are given by G ={G1, G2, . . . , GN0}, where G1, G2, . . . , GN are given initialclusters.Generally G j = {x j} X , hence N0 = N.

    Vol.16 No.7, 2012 Journal of Advanced Computational Intelligence 807and Intelligent Informatics

  • Takumi, S. and Miyamoto, S.

    Set K = N0.(K is the number of clusters and N0 is the initial numberof clusters.)Gi = Gi (i= 1, . . . ,K).

    Calculate s(G,G) for all pairs G,G G .AHC2: Search the pair of maximum similarity:

    (Gp,Gq) = arg maxGi,GjG

    s(Gi,Gj), . . . . . . (2)

    and let

    mK = s(Gp,Gq) = maxGi,GjG

    s(Gi,Gj). . . . (3)

    Merge: Gr = GpGq.Add Gr to G and delete Gp,Gq from G .K = K1.If K = 1 then stop and output the dendrogram.

    AHC3: Update similarity s(Gr,G) and s(G,Gr) forall G G .

    Go to AHC2.End AHC.

    Note: The calculation of s(G,Gr) in AHC3 is un-necessary when the measure is symmetric: s(Gr,G) =s(G,Gr).

    Well-known linkage methods such as the single link,complete link, and average link all assume symmetric dis-similarity measures [1, 2, 6]. In particular, the single linkuses the following inter-cluster similarity definition:

    s(G,G) = maxxG,yG

    s(x,y). . . . . . . . (4)

    When Gp and Gq are merged into Gr, the updating for-mula in AHC3 by the single link is:

    s(Gr,G) = s(GpGq,G)= max{s(Gp,G),s(Gq,G)}. . (5)

    The average link defines the next inter-cluster similarity:

    s(G,G) =1

    |G||G| xG,yGs(x,y) . . . . . (6)

    and the updating formula in AHC3 by the average link is:

    s(Gr,G) = s(GpGq,G)=|Gp||Gr| s(Gp,G

    )+|Gq||Gr| s(Gq,G

    ). (7)

    There are other well-known linkage methods of thecentroid link and the Ward method that assume objectsare points in the Euclidean space. They use dissimilaritymeasures related to the Euclidean distance. For example,the centroid link uses the square of the Euclidean distancebetween two centroids of the clusters. Anyway, the abovementioned five linkage methods all assume the symmetricproperty of similarity and dissimilarity measures.

    For the single link, complete link, and average link, itis known that we have the monotonicity of mK:

    mN mK1 m2 m1. . . . . . . (8)

    Fig. 1. Three points on a plane.

    C

    B

    A

    Fig. 2. A simple example of reversal.

    If the monotonicity does not hold, we have a reversal ina dendrogram: it means that G and G are merged intoG= GG at level m= s(G,G) and after that G and Gare merged at the level m = s(G,G), and m > m occurs.Reversals in a dendrogram is observed for the centroidmethod. Consider the next example [6, 7]:

    Example 1. Consider three points A,B,C on a plane inFig. 1. Two points A,B are nearer and these two are madeinto a cluster. Then the distance between the mid-point(centroid) of AB and C will be smaller than the distancebetween A and B. We thus have a reversal in Fig. 2.

    Apparently, if the monotonicity always holds for a link-age method, no reversals in the dendrogram will occur.

    By reviewing the above, the way how we calculateasymmetric similarity is given in the next section.

    3. Asymmetric Similarity Measures

    We assume hereafter that similarity measures are asym-metric in general:

    s(G,G) = s(G,G).

    808 Journal of Advanced Computational Intelligence Vol.16 No.7, 2012and Intelligent Informatics