improving suffix tree clustering base cluster ranking s(b) = |b| * f(|p|) |b| is the number of...
TRANSCRIPT
Improving Suffix Tree Clustering
• Base cluster rankings(B) = |B| * f(|P|)|B| is the number of documents in base cluster B|P| is the number of words in P that have a non-zero scorezero score words: stopwords, too few(<3) or too many( >40%)
• Tf-Idf is better
1
Improving Suffix Tree clustering
• Cluster similarity– Page overlap– Add: cluster label distance (word pair distance)
• Google normalised distance• WikiMiner: wikilink similarity
2
Improving suffix tree clustering
• 3rd step: cluster merging– If more than half overlapped pages, then merge– New: HAC
3
4
Query Directed Web Page Clustering
Daniel CrabtreePeter Andreae, Xiaoying Gao
Victoria University of Wellington
5
Related Work: Web Page Clustering• All Standard Algorithms
– partitioning (k-means), hierarchical (agglomerative, divisive), …………• Web Features
– structure, hyperlinks, colour• Textual Features
– STC: phrases, Lingo: latent semantic indexing• Word Semantics
– Global document analysis, co-occurrence statistics
• Query is never used
QDC – Query Directed Clustering
6
1: Find Base Clusters
2: Merge Clusters
3: Split Clusters
4: Select Clusters
5: Clean Clusters
QDC – 1: Find Base Clusters
• Clean Pages• Identify Base
Clusters• Prune Small
Clusters• Semantic Prune #1• Semantic Prune #2
7
Mac (28)
Car (40)
Auto (25)
Animal (18)
OS (12)
Atari (22)
Game (5)
Service (80)
Forest (11)
cluster size
distance(cluster,query)Score #1 = Score #2 =
Car
Home Page
Toyota Specific
Broad
Query: Jaguar
AmbiguousAmbiguous
QDC – 1: Query Distance
8
QDC – 1: Find Base Clusters
• Removes Many Base Clusters– Normally Negative Effect on Performance
But …
• Query Directed Score– Reliable Guide to Cluster Quality– Removes just Low Quality Clusters– Improves Performance
9
QDC – 2: Merge Clusters
• Merging
10
Mac (28)
Car (40)
Auto (25)
Animal (18)
OS (12)
Atari (22)
Car, Auto (40)
Mac, OS (28)
QDC – 2: Merge Clusters
• Single-link Clustering• Similarity Function
– Extension (by page overlap)– Intension (by description similarity)
• Global document analysis: co-occurrence frequency relative to expected frequency if independent
11
QDC – 2: Merge Clusters
• Reducing Page Overlap Threshold– Normally Negative Effect on Performance
But …
• Description Similarity– More semantically related clusters merge
• Increasing cluster coverage
– Fewer semantically unrelated clusters merge• Increasing cluster quality
12
QDC – 3: Split Clusters
• Single Link Merging– Cluster Chaining (Drifting)
• Hierarchical Agglomerative– Distance Measure: Path Length
13
QDC – 4: Select Clusters• ESTC cluster selection algorithm
– Heuristic based hill-climbing search with look-ahead and advanced branch and bound pruning
• Original heuristic– Page Coverage and Cluster Overlap
• New heuristic– Page Coverage and Cluster Overlap– Pages Not Covered and Cluster Quality
14
QDC – 5: Clean Clusters
• Page-Cluster Relevance– Based on Base Cluster Membership– Cluster Size, Cluster Quality
• Remove Outliers and Erroneous Inclusions• Sorting improves usability
1513
Evaluation
• Algorithm Efficiency on 250 Documents– Ten Times Faster than STC– One Hundred Times Faster than K-means
• Algorithm Performance– External Evaluation against a rich gold standard
• Real World Usability– Informal Usability Comparison with four algorithms
• K-means, ESTC, Lingo, Vivisimo
16
Evaluation: Algorithm Performance• External Evaluation against a rich gold standard • Four Algorithms
– STC, ESTC, K-means, Random• Four Data Sets
– Salsa, Jaguar, GP, Victoria University• Eleven Measurements
– Average and Weighted: Quality, Coverage, Precision, Recall, and Entropy + Mutual Information
• Snippets and Full Page Text
17
Evaluation: Quality and Coverage
18
Evaluation: Improvement over Random
19
Evaluation: Precision and Recall
20
Evaluation: Entropy and Mutual Information
21
Evaluation: Real World Usability
• QDC finds broader topics– Maximizes probability of
refinement– Simplifies user’s decision process
• Fewer choices• Less chance of multiple relevant
choices
• Fewer semantically meaningless clusters
22
Jaguar Results
Evaluation: Real World Usability
• Performance better than indicated by external evaluation– No penalty for overly specific clusters since gold standard
included them
• External evaluation shows QDC clusters have: – Fewer irrelevant pages– Cover more relevant pages
23
Conclusion
• QDC: New Web Page Clustering Algorithm• Key innovations:
– Query Directed Scoring– Merging using cluster descriptions– Solve cluster chaining by splitting– Improved cluster selection heuristic
• Vastly improved performance over other algorithms– External evaluation – Informal usability evaluation
24
25
Further Extension• Use Phrases rather than just Words
– STC, Lingo show large improvement possible
• Use Wiki Link similarity (WikiMiner) instead of GND• Future work:
– Improve cluster description similarity merging to consider entire description
– Common shared phrases as key features, use VSM, build vectors for each cluster, new weighting
– Formal usability evaluation