the 5th annual uk workshop on computational intelligence london, 5-7 september 2005 the 5th annual...
TRANSCRIPT
The 5th annual UK Workshop on Computational Intelligence
London, 5-7 September 2005
The 5th annual UK Workshop on Computational Intelligence
London, 5-7 September 2005
Department of Electronic & Electrical Engineering
University College London, UK
Learning Topic Hierarchies from Text Documents using
a Scalable Hierarchical Fuzzy Clustering Method
Learning Topic Hierarchies from Text Documents using
a Scalable Hierarchical Fuzzy Clustering Method
E. Mendes Rodrigues and L. Sacks
{mmendes, lsacks}@ee.ucl.ac.uk
http://www.ee.ucl.ac.uk/~mmendes/
E. Mendes Rodrigues and L. Sacks
{mmendes, lsacks}@ee.ucl.ac.uk
http://www.ee.ucl.ac.uk/~mmendes/
OutlineOutline
• Document clustering process
• H-FCM: Hyper-spherical Fuzzy C-Means
• H2-FCM: Hierarchical H-FCM
• Clustering experiments
• Topic hierarchies
Document Clustering ProcessDocument Clustering Process
DocumentRepresentation
DocumentEncoding
Document Clustering
Pre-processing
DocumentClusters
DocumentSimilarity
ClusteringMethod
Cluster Validity
DocumentCollection
Application
Document Clustering
DocumentSimilarity
ClusteringMethod
DocumentCollection
DocumentRepresentation
DocumentEncoding
Pre-processing
DocumentClusters
Cluster Validity
Application
Identify all unique words in the document collection
Discard common words that are included in the stop list
Apply stemming algorithm and combine identical word stems
Apply term weighting scheme to the final set of k indexing terms
Discard terms using pre-processing filters
DocumentVectors
x11 x12 x1k
x21 x22
xN1 xN2 xNk
X =
Vector-Space Model of Information Retrieval
Very high-dimensional
Very sparse (+95%)
Measures of Document RelationshipMeasures of Document Relationship
2/1k
1j
2Bj
k
1j
2Aj
k
1jBjAj
BABA
xx
xx1)x,x(S1)x,x(D
B,ABA ,1)x,x(S0
• FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering
non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms
• Cosine (dis)similarity measure:
widely applied in Information Retrieval
represents the cosine of the angle between two document vectors
insensitive to different document lengths, since it is normalised bythe length of the document vectors
H-FCM: Hyper-spherical Fuzzy C-MeansH-FCM: Hyper-spherical Fuzzy C-Means
• Applies the cosine measure to assess document relationships
• Modified objective function:
• Subject to an additional constraint:
• Fuzzy memberships (u) and cluster centroids (v):
)vx1(uDu)V,U(J N
1i
k
1jjij
c
1
mi
N
1i
c
1i
mim
1
1
)1(1
c m
i
ii D
Du
1.
1
1
2
1
N
i k
j
N
iij
mi
im
i
xu
xuv
,0v1vv1)v,v(Dk
1j
2j
k
1jjj
How many clusters?How many clusters?
• Usually the final number of clusters is not know a priori
Run the algorithm for a range of c values
Apply validity measures and determine which c leads to the best partition (clusters compactness, density, separation, etc.)
• How compact and dense are clusters in a sparse high-dimensional problem space?
Very small percentage of documents within a cluster present high similarity to the respective centroid clusters are not compact
However, there is always a clear separation between intra- and inter-cluster similarity distributions
H2-FCM: Hierarchical Hyper-spherical Fuzzy C-MeansH2-FCM: Hierarchical Hyper-spherical Fuzzy C-Means
• Key concepts
Apply partitional algorithm (H-FCM) to obtain a sufficiently large number of clusters
Exploit the granularity of the topics associated with each cluster to link cluster centroids hierarchically
Form a topic hierarchy
• Asymmetric similarity measure
Identify parent-child type relationships between cluster centroids
Child should be less similar to parent, than parent to child
k
1jj
k
1jjj v)v,vmin()v,v(S
S(v8,v5)<tPCS
C1
C3
C6
C9 C10
C12
C11
C8
C7
C4
C2
C5
Document Cluster centroid
The H2-FCM Algorithm
AsymmetricSimilarity
v1
v3
v6
v9v10
v12
v11
v8v7
v4
v2
v5
v1
v3
v6
v9v10
v12
v11
v8v7
v4
v2
v5
vVF S(v,v) = max[S(v,v)], v,vVF
v3
v6
v9
v10
v12
v11
v7
v4
v1
v8
v2
v5
v1
v8
v2
S(v1,v5)≥tPCS
v10
VF
VH
S(v8,v1)<tPCS
Compute S(v,v),
YApply H-FCM(c, m)
Allclusters have
size≥tND? Select
centroidWhileVF ≠
VH=?
N
Add root
Selectparent
S≥tPCS?
Add child
Y
NNc=c-K
Scalability of the AlgorithmScalability of the Algorithm
• H2-FCM time complexity depends on H-FCM and centroid linking heuristic
• H-FCM computation time is O(Nc2k)
• Linking heuristic is at most O(c2k)
Computation of the asymmetric similarity between every pair of cluster centroids - O(c2k)
Generation of the cluster hierarchy - O(c2)
• Overall, H2-FCM time complexity is O(Nc2k)
• Scales well to large document sets!
Description of ExperimentsDescription of Experiments
fntptp
R
• Goal: evaluate the H2-FCM performance
• Evaluation measures: clustering Precision (P) and Recall (R)
• H2-FCM algorithm run for a range of c values
• No. hierarchy roots=No. reference classes tPCS dynamically set
• Are sub-clusters of the same topic assigned to the same branch?
pftptp
P
In reference class Not in reference class
Assigned to cluster
true positives (tp) false positives (fp)
Not assigned to cluster false negatives (fn) true negatives (tn)
Test Document CollectionsTest Document Collections
Reuters-21578 test collection: http://www.daviddlewis.com/resources/testcollections/reuters21578/Open Directory Project (ODP): http://dmoz.org/INSPEC database: http://www.iee.org/publish/inspec/
CollectionSize Classes Document length Document sparsity
N k no. labels avg stdev avg stdev
reuters1 1708 15744 3acqearntrade
73.45 63.97 99.67 % 0.26 %
reuters2 1374 11778 5
crudeinterest
money-fxshiptrade
102.65 86.37 99.39 % 0.47 %
odp 556 620 5
gamelegomathsafetysport
15.14 5.07 97.69 % 0.50 %
inspec 7473 11803 3back-propagation
fuzzy controlPattern clustering
93.28 32.79 99.59 % 0.14 %
Clustering Results: H2-FCM Precision and RecallClustering Results: H2-FCM Precision and Recall
odp inspec
reuters1 reuters2
Topic HierarchyTopic Hierarchy
• Each centroid vector consists of a set of weighted terms
• Terms describe the topics associated with the document cluster
• Centroid hierarchy produces a topic hierarchy
Useful for efficient access to individual documents
Provides context to users in exploratory information access
Concluding RemarksConcluding Remarks
• H2-FCM clustering algorithm
Partitional clustering (H-FCM) Linking heuristic organizes centroids hierarchically bases on
asymmetric similarity
• Scales linearly with the number of documents
• Exhibits good clustering performance
• Topic hierarchy can be extracted from the centroid hierarchy