incorporating user provided constraints into document clustering yanhua chen, manjeet rege, ming...
Post on 21-Dec-2015
245 views
TRANSCRIPT
Incorporating User Provided Constraints into Document Clustering
Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi
Department of Computer Science
Wayne State University
Detroit, MI48202
{chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu
Outline
• Introduction
• Overview of related work
• Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
• Theoretical result for SS-NMF
• Experiments and results
• Conclusion
What is clustering?
• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
Document Clustering
• Grouping of text documents into meaningful clusters in an unsupervised manner.
Government
Science
Arts
Unsupervised Clustering Example
. .. ..
..
...
.
. .. ... .. ...
...
.. .. .
. ...
. .
Semi-supervised clustering: problem definition
• Input:– A set of unlabeled objects– A small amount of domain knowledge (labels or pairwise
constraints)
• Output:– A partitioning of the objects into k clusters
• Objective:– Maximum intra-cluster similarity– Minimum inter-cluster similarity– High consistency between the partitioning and the
domain knowledge
• According to different given domain knowledge:– Users provide class labels (seeded points) a priori to some
of the documents
– Users know about which few documents are related (must-link) or unrelated (cannot-link)
Semi-Supervised Clustering
Seeded points
Must-link
Cannot-link
Why semi-supervised clustering?
• Large amounts of unlabeled data exists– More is being produced all the time
• Expensive to generate Labels for data– Usually requires human intervention
• Use human input to provide labels for some of the data– Improve existing naive clustering methods– Use labeled data to guide clustering of unlabeled data– End result is a better clustering of data
• Potential applications– Document/word categorization– Image categorization – Bioinformatics (gene/protein clustering)
Outline
• Introduction
• Overview of related work
• Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
• Theoretical work for SS-NMF
• Experiments and results
• Conclusion
Clustering Algorithm• Document hierarchical clustering
– Bottom-up, agglomerative– Top-down, divisive
• Document partitioning (flat clustering)– K-means– probabilistic clustering using the Naïve Bayes or Gaussian
mixture model, etc.
• Document clustering based on graph model
Semi-supervised Clustering Algorithm
• Semi-supervised Clustering with labels (Partial label information is given ) :– SS-Seeded-Kmeans ( Sugato Basu, et al. ICML 2002)- SS-Constraint-Kmeans ( Sugato Basu, et al. ICML 2002)
• Semi-supervised Clustering with Constraints (Pairwise Constraints (Must-link, Cannot-link) is given):– SS-COP-Kmeans (Wagstaff et al. ICML01)– SS-HMRF-Kmeans (Sugato Basu, et al. ACM SIGKDD 2004)– SS-Kernel-Kmeans (Brian Kulis, et al. ICML 2005)– SS-Spectral-Normalized-Cuts (X. Ji, et al. ACM SIGIR 2006)
Overview of K-means Clustering
• K-means is a partition clustering algorithm based on iterative relocation that partitions a dataset into k clusters.
• Objective function: Locally minimizes sum of squared distance between the data points and their corresponding cluster centers:
Algorithm: Initialize k cluster centers randomly. Repeat until convergence:
– Cluster Assignment Step: Assign each data point xi to the cluster fh such that distance of xi from center of fh is minimum
– Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster
Semi-supervised Kernel K-means (SS-KK) [Brian Kulis, et al. ICML 2005]
• Semi-supervised Kernel K-means algorithm :
where is kernel function mapping from , is centroid, is the cost of violating the constraint between two points
– First term: kernel k-means objective function– Second term: reward function for satisfying must-link constraints– Third term: penalty function for violating cannot-link constraints
Overview of Spectral Clustering
• Spectral clustering is a graph-theoretic clustering algorithmWeighted Graph G=(V, E, A)
min between-cluster similarities (weights : Aij)
Spectral Normalized Cuts
• Min similarity between & :
Balance weights:
Cluster indicator:
• Graph partition becomes:
• Solution is eigenvector of:
Semi-supervised Spectral Normalized Cuts (SS-SNC) [X. Ji, et al. ACM SIGIR 2006]
• Semi-supervised Spectral Learning algorithm :
where , – First term: spectral normalized cut objective function – Second term: reward function for satisfying must-link
constraints– Third term: penalty function for violating cannot-link
constraints
Outline• Introduction
• Related work
• Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering– NMF review– Model formulation and algorithm derivation
• Theoretical result for SS-NMF
• Experiments and results
• Conclusion
Non-negative Matrix Factorization (NMF)
• NMF is to decompose matrix into two parts( D. Lee et al., Nature 1999)
• Symmetric NMF for clustering (C. Ding et al. SIAM ICDM 2005)
3172.03148.02568.02640.02650.0
3148.03244.02055.02090.02038.0
2568.02055.07202.07411.08311.0
2640.02090.07411.07822.08749.0
2650.02038.08311.08749.00000.1
X F G~=
min || X – FGT||2
~=
0348.05476.0
0005.05355.05256.03698.0
5538.03765.0
6449.03672.0
x
0402.20
00735.1
x
0348.00005.0
5476.05355.0
5256.05538.06449.0
3698.03765.03672.0
min || A – GSGT||2
SS-NMF
CLji Cdd ),(
• Incorporate prior knowledge into NMF based framework for document clustering.
• Users provide pairwise constraints:– Must-link constraints CML : two documents di
and dj must belong to the same cluster.
– Cannot-link constraints CCL : two documents di and dj must belong to the different cluster.
MLji Cdd ),(
• Constraints are defined by associated violation cost matrix W:– W reward : cost of violating the constraint between document
di and dj if a constraint exists.– Wpenalty : cost of violating the constraints between document
di and dj if a constraint exists.
MLji Cdd ),(
CLji Cdd ),(
SS-NMF Algorithm
• Define the objective function of SS-NMF:
where
2
0,0
~min T
GSNMFSS GSGAJ
}..,),(|{ jiMLjiijreward yytsCddwW
}..,),(|{ jiCLjiijpenalty yytsCddwW
penaltyreward WWAA ~
is the cluster label of iy id
Summary of SS-NMF Algorithm
Outline
• Introduction
• Overview of related work
• Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
• Theoretical result for SS-NMF
• Experiments and results
• Conclusion
Algorithm Correctness and Convergence
Based on constraint optimization theory, auxiliary function, we can prove SS-NMF:
1. Correctness: Solution converges to local minimum
2. Convergence: Iterative algorithm converges(Details in paper [1], [2])
[1] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2% ) [2] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, to appear, 2008.
SS-NMF: General Framework for Semi-supervised Clustering
jiCLjih jiMLji yytsCddij
k
h Xi yytsCddijhiKKSS wwdJ
..),(1 ..),(
2
,,
)(
Proof: (1)
(2)
(3)
Orthogonal Symmetric Semi-supervised NMF is equivalent to Semi-supervised
Kernel K-means (SS-KK) and Semi-supervised Spectral Normalized Cuts (SS-SNC)!
Advantages of SS-NMF
SS-KK SS-SNC SS-NMF
Clustering Indicator
•Hard clustering•Exact orthogonal
•The derived latent semantic space to be orthogonal•No direct relationship between the singular vectors and the clusters
•Soft clustering•Map the documents into non-negative latent semantic space which may not be orthogonal•Cluster label can be determined by the axis with the largest projection value
Time Complexity
•Iterative algorithm
•Solving a computationally expensive constrained eigen-decomposition
•Iterative algorithm to obtain partial answer at intermediate stages of the solution by specifying a fixed number of iterations•Simple basic matrix computation and easily deployed over a distributed computing environment when dealing with large document collections.
Outline• Introduction
• Overview of related work
• Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
• Theoretical result for SS-NMF
• Experiments and results– Artificial Toy Data– Real Data
• Conclusion
Experiments on Toy Data
1. Artificial toy data: consisting of two natural clusters
Results on Toy Data (SS-KK and SS-NMF)
Right Table:
Difference between cluster indicator G of SS-KK (hard clustering) and SS-NMF (soft clustering) for the toy data
• Hard Clustering: Each object belongs to a single cluster
• Soft Clustering: Each object is
probabilistically assigned to clusters.
Results on Toy Data (SS-SNC and SS-NMF)
(b) Data distribution in the SS-NMF subspace of two column vectors of G. The data points from the two clusters get distributed along the two axes.
(a) Data distribution in the SS-SNC subspace of the first two singular vectors. There is no relationship between the axes and the clusters.
Time Complexity Analysis
Up Figure: Computational Speed comparison for SS-KK, SS-SNC and SS-NMF ( ))( 2tkn
Experiments on Text Data
iy
2. Summary of data sets[1] used in the experiments.
[1]http://www.cs.umn.edu/~han/data/tmdata.tar.gz
• Evaluation Metric:
where n is the total number of documents in the experiment, δis the delta function that equals one if , is the estimated label, is the ground truth.
iy ii yy ˆ
Results on Text Data (Compare with Unsupervised
Clustering)• (1) Comparison with unsupervised clustering approaches:
Note: SS-NMF adds 3% constraints
Results on Text Data(Before Clustering and After
Clustering)
(a) Typical document-document matrix before clustering
(b) Document-document similarity matrix after clustering with SS-NMF (k=2)
(c) Document-document similarity matrix after clustering with SS-NMF (k=5)
Results on Text Data (Clustering with Different
Constraints)
Left Table:
Comparison of confusion matrix C and normalized cluster centroid matrix S of SS-NMF for different percentage of documents pairwise constrained
Results on Text Data (Compare with Semi-supervised
Clustering)• (2) Comparison with SS-KK and SS-SNC
(a) Graft-Phos (b) England-Heart (c) Interest-Trade
• Comparison with SS-KK and SS-SNC (Fbis2, Fbis3, Fbis4, Fbis5)
Results on Text Data (Compare with Semi-supervised
Clustering)
Experiments on Image Data
Up Figure: Sample images for images categorization. (From up to down: O-Owls, R-Roses, L-Lions, E-Elephants, H-Horses)
3. Image data sets[2] used in the experiments.
[2] http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html
Results on Image Data (Compare with Unsupervised
Clustering)
Up Table : Comparison of image clustering accuracy between KK, SNC, NMF and SS-NMF with only 3% pair-wise constraints on the images. It shows that SS-NMF consistently outperforms other well-established unsupervised image clustering methods.
• (1) Comparison with unsupervised clustering approaches:
Results on Image Data (Compare with Semi-supervised
Clustering)• (2) Comparison with SS-KK and SS-SNC:
Left Figure:
Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (a) O-R, (b) L-H, (c) R-L, (d) O-R-L.
Results on Image Data (Compare with Semi-supervised
Clustering)• (2) Comparison with SS-KK and SS-SNC:
Left Figure:
Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (e) L-E-H, (f) O-R-L-E, (g) O-L-E-H, (h) O-R-L-E-H
Outline
• Introduction
• Related work
• Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
• Theoretical result for SS-NMF
• Experiments and results
• Conclusion
Conclusion
• Semi-supervised Clustering: - many real world applications- outperform the traditional clustering algorithms
• Semi-supervised NMF algorithm provides a unified mathematic framework for semi-supervised clustering.
• Many existing semi-supervised clustering algorithms can be extended to achieve multi-type objects co-clustering tasks.
Reference
[1] Y. Chen, M. Rege, M. Dong and F. Fotouhi, “Deriving Semantics for Image Clustering from Accumulated User Feedbacks”, Proc. of ACM Multimedia, Germany, 2007.
[2] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2%)
[3] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, invited as a best paper of ICDM 07, to appear 2008.