density clustering method for gene expression dataperrizo/saturday/externaldrive/... · web viewa...

14
A Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang, Honglin Li, William Perrizo

Upload: others

Post on 07-Jan-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

A Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang, Honglin Li, William Perrizo

Page 2: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

Computer Science DepartmentNorth Dakota State University

Fargo, ND 58105Tel: (701) 231-6257Fax: (701) 231-8255

{baoying.wang, honglin.li, william.perrizo}@ndsu.nodak.edu

Page 3: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

Abstract. Data clustering methods have been proven to be a successful data mining technique in the analysis of gene expression data. However, some concerns and challenges still remain in gene expression clustering. For example, many traditional clustering methods originated from non-biological fields may break down if the choice of parameters is incorrect, or if the model is not sufficient to capture the genuine clusters among noisy data. In this paper, we propose an efficient comprehensive hierarchical clustering method using attractor trees (CAMP) based on both density and similarity factors. The combination of density-based approach and similarity-based approach takes consideration of clusters with diverse shapes, densities, and sizes. A vertical data structure, P-tree1, is used to make the clustering process even more efficient by accelerating the calculation of the density function using neighborhood rings. Experiments on common gene expression datasets demonstrate that our approach is more efficient and scalable with competitive accuracy.

Keywords: gene expression data, hierarchical clustering, P-trees, microarray.

1 Introduction

Clustering in data mining is a discovery process that partitions the data set into groups such that the data points in the same group are more similar to each other than the data points in other groups. Clustering analysis of mircroarray gene expression data, which discovers groups that are homogeneous and well separated, has been recognized as an effective method for gene expression analysis.

Eisen et al first applied hierarchical linkage clustering approach that groups closest pairs into a hierarchy of nested subsets based on similarity [5]. Golub et al has also successfully discovered the tumor classes based on the simultaneous expression profiles of thousands of genes from acute leukemia patient’s testing samples using self-organizing maps clustering approach [7]. Some other clustering approaches, such as k-mean [14], fuzzy k-means [1], CAST[2], etc, also have been proven to be valuable clustering methods for gene expression data analysis. However, some concerns and challenges still remain in gene expression clustering. For example, many traditional clustering methods originated from non-biological fields may break down if the choice of parameters is incorrect, or if the model is not sufficient to capture the genuine clusters among noisy data.

There are mainly two clustering methods: similarity-based partitioning methods and density-based clustering methods. A similarity-based partitioning algorithm breaks a dataset into k subsets, which are supposed to be convexes and are in similar sizes. Density-based

clustering assumes that all points within each cluster are density reachable, and all clusters are in similar densities. Most hierarchical clustering methods are based on similarity-based partitioning algorithms. As a result, they can only handle clusters in convex shapes and with similar size. In this paper, we propose an efficient comprehensive agglomerative hierarchical clustering method using attractor trees (CAMP). CAMP combines the features of both density-based clustering approach and similarity-based clustering approach, which takes consideration of clusters with diverse shapes, densities, and sizes.

A vertical data structure, P-trees is used to make the algorithm more efficient by accelerating the calculation of the density function. P-trees are also used as bit indexes to clusters. In the merging process, only summary information of the attractor sub-trees is used to find the closest cluster pair. When two clusters need to be merged, their P-tree indexes are retrieved for merging. The clustering results are an attractor tree and a collection of P-tree indexes to clusters corresponding to each level of the attractor tree. In Experiments on common gene expression datasets demonstrate that our approach is more efficient and scalable with competitive accuracy.

This paper is organized as follows. In section 2 we give an overview of the related work. We present our new clustering method, CAMP, in section 3. Section 4 discusses the implementation of CAMP using P-trees. An experimental performance study is described in section 5. Finally we conclude the paper in section 6.

2 Related Work

2.1 Similarity-based clustering vs. density-based clustering

There are mainly two major clustering categories: similarity based partitioning methods and density-based clustering methods. A similarity-based partitioning algorithm breaks a dataset into k subsets, called clusters. The major problems with similarity-based partitioning methods are: (1) k has to be predetermined; (2) it is difficult to identify clusters with different sizes; (3) it only finds convex clusters. Density-based clustering methods have been developed to discover clusters with arbitrary shapes. The most typical algorithm is DBSCAN [6]. The basic idea for the algorithm DBSCAN is that for each point of a cluster the neighborhood of a given radius () has to contain at least a minimum number of points (MinPts) where and MinPts are input parameters.

2.2 Hierarchical clustering algorithms

Hierarchical algorithms create a hierarchical decomposition of a dataset X. The hierarchical

1 Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.

Page 4: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

decomposition is represented by a dendrogram, a tree that iteratively splits X into smaller subsets until each subset consists of only one object. In such a hierarchy, each level of the tree represents a clustering of X.

Most hierarchical clustering algorithms are variants of the single-link and the complete link approaches. In the single-link method, the distance between two clusters is the minimum of the distances between all pairs of points, which are from either of the two clusters. In the complete-link algorithm, the distance between two clusters is the maximum of all pair-wise distance between points in the two clusters. In either case, two clusters are merged to form a larger cluster based on minimum distance (or maximum similarity) criteria. The complete-link algorithm produces tightly bound or compact clusters while the single-link algorithm suffers when there are a chain of noises between two clusters.

CHAMELEON [10] is a variant of the complete-link approaches. It operates on a k-nearest neighbor graph. The algorithm first uses a graph partitioning approach to divide the dataset into a set of small clusters. Then the small clusters are merged based on their similarity measure. CHAMELEON has been found to be very effective in clustering convex shapes. However, the algorithm is not designed for very noisy data sets.

When the dataset size is large, hierarchical clustering algorithms break down due to their non-linear time complexity and huge I/O costs. In order to remedy this problem, BIRCH [16] was developed. BIRCH performs a linear scan of all data points and the cluster summaries are stored in memory in the data structure called a CF-tree. A nonleaf node represents a cluster consisting of all the subclusters represented by its entries. A leaf node has to contain at most L entries and the diameter of each entry in a leaf node has to be less than T. A point is inserted by inserting the corresponding CF-value into the closest leaf of the tree. If an entry in the leaf can absorb the new point without violating the threshold condition, the CF-values for this entry are updated; otherwise a new entry in the leaf node is created. Once the clusters are generated, each data point is assigned to the cluster with the closest centroid. Such label assignment may have problems when the clusters do not have similar size and shapes.

2.3 Clustering methods of gene expression data

There are many newly developed clustering methods which are dedicated to gene expression data. These clustering algorithms partition genes into groups of co-expressed genes.

Eisen et al [5] adopted a hierarchical approach using UPGMA (Unweighed Pair Group Method with Arith-metic Mean) to group closest gene pairs. This method displays the clustering results in a colored graph pattern. In this method, the gene expression data is colored according to the measured fluorescence ratio, and genes are re-ordered based on the hierarchical dendrogram structure.

Ben-Dor et al. [2] proposed a graph-based algorithm CAST (Cluster Affinity Search Techniques). Two points are linked in the graph if they are similar. The problem of clustering a set of genes is then converted to a classical graph-theoretical problem. CAST takes as input a parameter called the affinity threshold t, where 0 < t < 1, and tries to guarantee that the average similarity in each generated cluster is higher than the threshold t.

Hartuv et al. [9] presented an algorithm HCS (Highly Connected Subgraph). HCS recursively splits the weighted graph into a set of highly connected sub-graphs along the minimum cut. Each highly connected sub-graph is called a cluster. Later on, the same research group developed another algorithm, CLICK (Cluster Identification via Connectivity Kernels) [13]. CLICK builds up a statistic framework to measure the coherence within a subset of genes and determines the criterion to stop the recursive splitting process.

3 Agglomerative Hierarchical Clustering Using Attractor Trees

In this section, we propose a comprehensive agglomerative hierarchical clustering using attractor trees (CAMP). CAMP consists of two processes: (1) clustering by local attractor trees (CLA) and (2) cluster merging based on similarity (MP). The final clustering result is an attraction tree and a set of P-tree indexes to clusters corresponding to each level of the attractor tree. The attraction tree is composed of leaf nodes, which are the local attractors of the attractor sub-trees constructed in CLA process, and interior nodes, which virtual attractors resulted from MP process. Figure 1 is an example of an attraction tree.

Figure 1. The attraction tree

The data set is first grouped into local attractor trees by means of density-based approach in CLA process. Each local attractor tree represents a preliminary cluster, the root of which is a density attractor of the cluster. Then the small clusters are merged level-by-level in MP process according to their similarity until the whole data set becomes a cluster.

In this section, we first define density function of data points and describe the detailed clustering process of CLA. Then we define similarity function between clusters and propose the algorithm of cluster merging process (MP). Finally we discuss our noise handling technique.

Local attractor

Virtual attractors

Page 5: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

3.1 Density Function

Given a data point x, the density function of x is defined as the sum of the influence functions of all data points in the data space X. There are many ways to calculate the influence function. The influence of a data point on x is inversely proportional to the distance between the point and x. If we divided the neighborhood of x into neighborhood rings, then points within smaller rings have more influence on x than those in bigger rings. We define the neighborhood ring as follows:

Definition 1. Neighborhood Ring of a data point c with radii r1 and r2 is defined as the set R(c, r1, r2) = {x X | r1<|c-x| r2}, where |c-x| is the distance between x and c. The number of neighbors falling in R(c, r1, r2) is denoted as N = || R(c, r1, r2)||.

Definition 2. Equal Interval Neighborhood Ring (EINring) of a data point c with radii r1=k and r2=(k+1) is defined as the kth neighborhood ring EINring(c, k, ) = R(c, r1, r2) = R (c, k, (k+1)), where is a constant. Figure 2 shows 2-D EINrings with k = 1, 2, and 3. The number of neighbors falling within the kth EINring is denoted as ||EINring(c, k, )||.

Let y be a data point within the kth EINring of x. The EINring-based influence function of y on x is defined as:

f(y,x) = fk(x) = 1 /logk ()

The density function of x is defined as the summation of influence within every EINring neighborhood of x.

DF(x)= ()

Figure 2. Diagram of EINrings.

3.2 Clustering by Local Attractor Trees

The basic idea of clustering by local attractor trees (CLA) is to partition the data set into clusters in terms of density attractor trees. The clusters can be any shape and any size of density-connected areas. Given a data point x, if we follow the steepest density ascending path, the path will finally lead to a local density

attractor. If x doesn’t have such a path, it can be either a local attractor or a noise. All points whose steepest ascending paths lead to the same local attractor form a cluster. The resultant graph is a collection of local attractor trees with the local attractor as the root. The leaves are the boundary points of clusters. An example of a dataset and the attractor trees are shown in Figure3.

Figure 3. A dataset and the attractor trees

Given the step size s and the EINring interval , the CLA clustering is processed as follows:

1. Compute density function for each point;2. For an arbitrary point x, find the point with the

highest density in the neighborhood R(x, 0, s). If it is higher than the density of x, build a direct edge from x to that point.

3. If none of the neighbors has higher density than x, x becomes the root of the local attractor tree, and is assigned with a new cluster label.

4. Go back to step 2 with the next point.5. The data points in each attractor tree are

assigned with the same cluster label as the attractor/root.

3.3 Similarity between Clusters

There have been many similarity measures to find the most similar cluster pairs. CURE uses the similarity between the closest pair of points which belong to different clusters [8]. CHAMELEON uses both relative connectivity and relative closeness by complicated graph implementations. We also consider both relative connectivity and relative closeness, but combine them by means of attractor trees. We define similarity between cluster i and j as follows:

CS(i, j) = (3)

where is the average height of the ith attractor tree; is the average fan-out of the ith attractor tree; d(Ai, Aj) is the distance between two local attractors Ai and Aj. The calculations of and are discussed later.

Cluster similarity depends on the following factors: the distance between attractors of two clusters, the average heights and average fan-outs of two clusters. Our cluster similarity function can distinguish the cases shown in Figure 4. The cluster pairs on the left have higher similarity than those on the right according to our definition.

C i

Page 6: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

(a) Different average heights

(b) Different average fan-outs

(c) Different d(Ai, Aj)

Figure 4. Examples of cluster similarity

3.4 Cluster Merging Process

After the local attract trees (sub-clusters) are built in CLA process, cluster merging process (MP) starts combining the most similar sub-cluster pair level-by-level based on similarity measure. When two clusters are merged, two local attractor trees are combined into a new tree, called a virtual local attractor tree. It is called “virtual” because the new root is not a real attractor. It is only a virtual attractor which could attract all points of two sub-trees. The merging process is shown in Figure 5. The cluster merging is processed recursively by combining (virtual) attractor trees.

(a) Before merging (b) After merging

Figure 5. Cluster merging process

After merging, we need to compute the new root/attractor Av, the average height , the average fan-out of the new virtual attractor tree. Take two clusters: Ci and Cj, for example, and assume the size of Cj is greater or equal than Ci, i.e. ||Cj|| ||Ci||, we have the following equations:

Avl = l = 1, 2 ... d (4)

= Max{ , }+ (5)

= (6)

where Ail is the lth attribute of the attractor Ai. ||Ci|| is the size of cluster Ci. d(Ai, Aj) is the distance between two local attractors Ai and Aj. and are the average height and the average fan-out of the ith attractor tree respectively.

3.5 Delayed Noise Eliminating Process

Since gene expression data is high noise data, it is important to have a proper noise eliminating process. Naively, the points which stand alone after CLA process should be noises. However, some sparse clusters might be mistakenly eliminated if the step size s is small. Therefore, we delayed the noise-handling process till the later stage. The neighborhoods of noises are generally sparser than points in clusters. In cluster merging process, noises tend to merge with other points with much fewer chances, and grow much slowly. Therefore, the cluster merging process is tracked to capture those clusters which are growing very slowly. In case of slow growing clusters, if the cluster is small, eliminate the whole cluster as noise. Otherwise if the cluster is large, peel off the points which are recently merged to it at a low rate.

4 Implementation of CAMP in P-trees

CAMP is implemented using the data-mining-ready vertical bitwise data structures, P-trees, to make the clustering process much more efficient and scalable. The P-tree technology was initially developed by the DataSURG research group for spatial data [12] [4]. P-trees provide a lot of information and are structured to facilitate data mining processes. In this section, we first briefly discuss representation of a gene dataset in P-tree structure and P-tree-based neighborhood computation. Then we detail the implementation of CAMP using P-trees.

4.1 Data Representation

We organize the gene expression data as a relational table with row of genes and column of experiments, or time series. Instead of using double precision float numbers with a mantissa and exponent

Ai Aj

Av

Page 7: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

represented in complement of two, we partition the data space of gene expression as follows. First, we need to decide the number of intervals and specify the range of each interval. For example, we could partition the gene expression data space into 256 intervals along each dimension equally. After that, we replace each gene value within the interval by a string, and use strings from 00000000 to 11111111 to represent the 256 intervals. The length of the bit string is base two logarithm of the number of intervals. The optimal number of intervals and their ranges depend on the size of datasets and accuracy requirements.

Given a gene table G = (E1, E2 … Ed), and the binary representation of jth attribute Ej as bj,mbj,m-1...bj,i… bj,1bj,0, the table are projected into columns, one for each attribute. Then each attribute column is further decomposed into separate bit vectors, one for each bit position of the values in that attribute. Figure 6 shows a relational table with three attributes. Figure 7 shows the decomposition process from the gene table G to a set of bit vectors.

Figure 6. An example of gene table.

Figure 7. Decomposition of the gene table

After decomposition process, each bit vectors is then converted into a P-tree. A P-tree is built by recording the truth of the predicate “purely 1-bits” recursively on halves of the bit vectors until purity is reached. Three P-tree examples are illustrated in Figure 8.

(a) P21 (b) P22 (c) P 23

Figure 8. P-trees of attributes E21, E22 and E23

The P-tree logic operations are pruned bit-by-bit operations, and performed level-by-level starting from the root level. For instance, ANDing a pure-0 node with any node results in a pure-0 node, ORing a pure-1 node with any results in a pure-1 node.

4.2 P-tree based neighborhood computation

The major computational cost of CAMP is in the preliminary clustering process, CLA, which mainly involves computation of densities. To improve the efficiency of density computation, we adopt the P-tree based neighborhood computation by means of the optimized P-tree operations. In this section, we first review the optimized P-tree operations. Then we present the P-tree based neighborhood computation.

P-tree predicate operations: Let A be jth dimension of data set X, m be its bit-width, and Pm, Pm-1, … P0 be the P-trees for the vertical bit files of A. c=bm…bi…b0, where bi is ith binary bit value of c. Let PA >c and PAc be the P-tree representing data points satisfying the predicate A>c and Ac respectively, then we have

PA >c = Pm opm … Pi opi Pi-1 … opk+1 Pk, kim (7)

where opi is if bi=1, opi is otherwise.

PAc = P’mopm … P’i opi P’i-1 … opk+1P’k, kim (8)

where 1). opi is if bi=0, opi is otherwise.In equations above, k is the rightmost bit position

with value of “0”. The operators are right binding.

Calculation of neighborhood: Let Pc,r be the P-tree representing data points within the neighborhood R(c, 0, r) = {x X | 0<|c-x| r} Note that Pc,r is just a P-tree representing data points satisfying the predicate c-r<xc+r. Therefore

Pc,r = Pc-r<xc+r = Px >c-r Pxc+r (9)

G (E1, E2, E3)

A25 2 7 7 2431

2 3 2 2 5723

7 2 2 5 5114

0 1 0 1

0

0 00 0 1 0

0 1

0

1 0 0 1

0

0 0 1 0

E1 E2 E3

1 0 1 1 0100

E13 E12 E11

A2

E23 E22 E21

A2

E33 E32 E31

A20 1 1 1 1010

1 0 1 1 0011

0 0 0 0 1100

1 1 1 1 0111

0 1 0 0 1101

1 0 0 1 1001

1 1 1 0 0000

1 0 0 1 1110

101010111111 010100011001

010011010010101111010011

111010010101101001001100

G (E1, E2, E3)

A2

Page 8: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

where Px >c-r and Pxc+r are calculated by means of P-tree predicate operations above. Calculation of the EINring neighborhood: Let Pc,k be the P-tree representing data points within EINring(c, k, ) = {x X | k< |c-x| (k+1)}. In fact, EINring(c, k, ) neighborhood is the union of R(c, 0, (k+1)) and the complement of R(c, 0, k). Hence

Pc,k = Pc, (k+1) P’c, k (10)

where P’c, k is the complement of Pc, k.

The count of 1’s in Pc,k, ||Pc,k||, represents the number of data points within the EINring neighborhood, i.e. ||EINring(c, k, )|| = ||Pc,k||. Each 1 in Pc,k indicates a specific neighbor point.

4.3 Implementation of CAMP using P-trees

CAMP consists of two steps: clustering using local attractor trees (CLA) and cluster merging process (MP). The critical issues are computations of density function and similarity function and manipulation of the (virtual) attractor trees during clustering process. In fact, similarity function is easy to compute, given the summary information of two attractor trees. In this section, we will focus on density function and the attractor trees.

Computation of density function (in CLA process): According to equation (2) and (10), the density function is calculated using P-trees as follows:

DF(x) = (11)

Structure of a (virtual) attractor tree: An attractor tree consists of two parts: (1) a collection of summary data, such as the size of the tree, the attractor, average height, average fan-out, etc; and (2) a P-tree used as an index to points in the attractor trees. In merging process, only the first part needs to be in memory. The second part is only needed at the time of merging. Besides, a lookup table is used to record the level of each point in the attractor tree. The lookup table is only used in initial clustering process.

Here is an example of an index P-tree. Assume the dataset size is 8, and an attractor tree contains the first four points and the sixth point in the data set. The corresponding bit index is (11110100), which is converted into a P-tree as shown in Figure 9.

Figure 9. The P-tree for an attractor tree.

Creating an attractor tree (in CLA process): When a steepest ascending path (SAP) from a point stops at a new local maximal, we need to create a new attractor tree. The stop point is the attractor. The average height

= Ns / 2, where Ns is the number of steps in the path. The average fan-out = 1. The corresponding index P-tree is built.

Updating an attractor tree (in CLA process): If the SAP encounters a point in an attractor tree, the whole SAP is inserted into the attractor tree. As the result, the attractor tree needs to be updated. The attractor doesn’t change. The new average height and the new average fan-out are calculated as follows:

= (12)

(13)

where Nold is the size of the old attractor tree; m is the number of points added to the tree; l represents the level of the insertion point; is the number of interior nodes of the old attractor tree.

Merging two attractor trees (in MP process): When two attractor trees are combined into a new virtual attractor tree, the summary data are computed using equations (4) – (6). A new P-tree is formed simply by ORing two old P-trees, i.e. Pv= Pi Pj.

5 Performance Study

To evaluate the efficiency, accuracy, and robustness of our approach (CAMP), we used three microarray expression datasets: DS1 and DS2 and DS3. DS1 is the dataset used by CLICK [13]. It contains expression levels of 8,613 human genes measured at 12 time-points. DS2 and DS3 were obtained from the Michael Eisen's lab [11]. DS2 is a gene expression matrix of 6221 80. DS3 is the largest dataset with

0 1

0

1 0 0 0

1 1 1 1 0 1 0 0

bit index

P-tree

Page 9: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

13,413 genes under 36 experimental conditions. The raw expression data was first normalized [2]. Then the datasets were then decomposed and converted to P-trees. We implemented k-means [14], BIRCH, CAST [2], and CAMP algorithms in C++ language on a Debian Linux 3.0 PC with 1 GHz Pentium CPU and 1 GB main memory.

The total run times for different algorithms on DS1, DS2 and DS3 are shown in Figure 10. Note that our approach outperformed k-means, BIRCH and CAST substantially when the dataset is large. In particular, our approach performed almost 4 times faster than k-means for DS3.

02000400060008000100001200014000

run

time(

s)

DS1 DS2 DS3

K-means BIRCH CAST CAMP

Figure 10. Run time comparisons

The clustering results are evaluated by means of “Hubert’s statistic” [15]. Given two matrixes X=[X(i, j)] and Y=[Y(i, j)], where X(i, j) is a similarity matrix between every pair of genes, and Y(i, j) is defined as

Hubert’s statistic indicates the point serial correlation between two matrixes X and Y, and is computed as

where

M = n (n - 1) / 2 and is between [-1, 1]. is used to measure the correlation between the similarity matrix X and the adjacent matrix of the clustering results.

The best clustering qualities for different methods on DS1, DS2 and DS3 are shown in Figure 11. From Figure 11, it is obvious that our approach and CAST have better clustering results than the other two methods. For DS1, our approach has better results than CAST.

0

0.2

0.4

0.6

0.8

K-means 0.446 0.302 0.285BIRCH 0.51 0.504 0.435CAST 0.678 0.685 0.727CAMP 0.785 0.625 0.667

DS1 DS2 DS3

Figure 11. value comparisons

In summary, CAMP outperforms the other methods in terms of execution time with high scalability. Our clustering results are almost as good as CAST which, however, is not scalable for large datasets.

6 Conclusion

In this paper, we have proposed an efficient comprehensive hierarchical clustering method using attractor trees, CAMP, which combines the features of both density-based clustering approach and similarity-based clustering approach. The combination of density-based approach and similarity-based approach takes consideration of clusters with diverse shapes, densities, and sizes. A vertical data structure, P-trees, and optimized P-tree operations are used to make the algorithm more efficient by accelerating the calculation of the density function. Experiments on common gene expression datasets demonstrated that our approach is more efficient and scalable with competitive accuracy. As a result, our approach can be a powerful tool for gene expression data analysis.

In the future, we will apply our approach to large scale time series gene expression data, where the efficient and scalable analysis approach is in demand. We will also work on poster cluster analysis and result interpretation. For example, we will explore to build Bayesian network to model the potential pathway for each discovered cluster and subcluster.

Reference

1. Arima, C and Hanai, T. “Gene Expression Analysis Using Fuzzy K-Means Clustering”, Genome Informatics 14, pp. 334-335, 2003.

2. Ben-Dor, A., Shamir, R. & Yakhini, Z. “Clustering gene expression patterns,” Journal of Computational Biology, Vol. 6, 1999, pp. 281-297.

n

k

k

nGG

G

1

2

1

1 1

),(),(1 n

i

n

ij YX

YjiYXjiXM

Page 10: Density Clustering Method for Gene Expression Dataperrizo/saturday/ExternalDrive/... · Web viewA Comprehensive Hierarchical Clustering Method for Gene Expression Data Baoying Wang,

3. Cho, R. J. M. et al. “A Genome-Wide Transcriptional Analysis of The Mitotic Cell Cycle.” Molecular Cell, 2:65-73, 1998.

4. Ding, Q., Khan, M., Roy, A., and Perrizo, W., “The P-Tree Algebra”, ACM SAC, 2002.

5. Eisen, M.B., Spellman, P.T., “Cluster analysis and display of genome-wide expression patterns”. Proceedings of the natinoal Academy of Science USA, pp. 14863-14868, 1995.

6. Ester, M., Kriegel, H-P., Sander, J. And Xu, X. “A density-based algorithm for discovering clusters in large spatial databases with noise”. In Proceedings of the 2nd ACM SIGKDD, Portland, Oregon, pp. 226-231, 1996.

7. Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M. et al. “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring”. Science 286, pp. 531-537, 1999.

8. Guha, K. S. and Rastogi, S. R. CURE: An efficient clustering algorithm for large databases. In SIGMOD’98, Seattle, Washington, 1998.

9. Hartuv, E. and Shamir, R. A clustering algorithm based on graph connectivity. Information Processing Letters, 76(4-6):175-181, 2000.

10. Karypis, G., Han, E.-H. and Kumar, V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75, August 1999.

11. Michael Eisen's gene expression data is available at http://rana.lbl.gov/EisenData.htm

12. Perrizo, W., “Peano Count Tree Technology”. Technical Report NDSU-CSOR-TR-01-1, 2001.

13. Shamir R. and Sharan R. CLICK: A clustering algorithm for gene expression analysis. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB '00). AAAI Press. 2000.

14. Tavazoie, S. J. Hughes, D. and et al. “Systematic determination of genetic network architecture”. Nature Genetics, 22, pp. 281-285, 1999.

15. Tseng, V. S. and Kao, C. “An Efficient Approach to Identifying and Validating Clusters in Multivariate Datasets with Applications in Gene Expression Analysis,” Journal of Information Science and Engineering, Vol. 20 No. 4, pp. 665-677. 2004.

16. Zhang, T., Ramakrisshnan, R. and Livny, M. BIRCH: an efficient data clustering method for very large databases. In Proceedings of of Int’l Conf. on Management of Data, ACM SIGMOD 1996.