information retrieval lecture 7 introduction to information retrieval (manning et al. 2007) chapter...
Post on 14-Dec-2015
218 Views
Preview:
TRANSCRIPT
Information Retrieval
Lecture 7Introduction to Information Retrieval (Manning et al. 2007)
Chapter 17
For the MSc Computer Science Programme
Dell ZhangBirkbeck, University of London
Yahoo! Hierarchy
http://dir.yahoo.com/science
dairycrops
agronomyforestry
AI
HCIcraft
missions
botany
evolution
cellmagnetism
relativity
courses
agriculture biology physics CS space
... ... ...
… (30)
... ...
Hierarchical Clustering
Build a tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents. Divisive (top-down)
Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own.
Recursive application of a (flat) partitional clustering algorithm, e.g., kMeans (k=2) Bi-secting kMeans.
Agglomerative (bottom-up) Start with each document being a single cluster.
Eventually all documents belong to the same cluster.
Dendrogram
Clustering is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.
The number of clusters k is not required in advance.
HAC
Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster. Repeat until there is only one cluster:
Among the current clusters, determine the pair of clusters, ci and cj, that are most similar. (Single-Link, Complete-Link, etc.)
Then merges ci and cj to a single cluster. The history of merging forms a binary tree or
hierarchy.
Single-Link
The similarity between a pair of clusters is defined by the single strongest link (i.e., maximum cosine-similarity) between their members:
After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
),(max),(,
yxsimccsimji cycx
ji
),(),,(max)),(( kjkikji ccsimccsimcccsim
HAC – Example
d1
d2
d3
d4
d5
d1,d2
d4,d5
d3
d3,d4,d5
As clusters agglomerate, docs are likely to fall into a dendrogram.
top related