information retrieval lecture 7 introduction to information retrieval (manning et al. 2007) chapter...

Information Retrieval

Lecture 7Introduction to Information Retrieval (Manning et al. 2007)

Chapter 17

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

Yahoo! Hierarchy

http://dir.yahoo.com/science

dairycrops

agronomyforestry

HCIcraft

missions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

... ...

Hierarchical Clustering

Build a tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents. Divisive (top-down)

Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own.

Recursive application of a (flat) partitional clustering algorithm, e.g., kMeans (k=2) Bi-secting kMeans.

Agglomerative (bottom-up) Start with each document being a single cluster.

Eventually all documents belong to the same cluster.

Dendrogram

Clustering is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

The number of clusters k is not required in advance.

Dendrogram – Example

Clusters of News Stories:Reuters RCV1

Dendrogram – Example

Clusters of Things that People Want: ZEBO

Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster. Repeat until there is only one cluster:

Among the current clusters, determine the pair of clusters, ci and cj, that are most similar. (Single-Link, Complete-Link, etc.)

Then merges ci and cj to a single cluster. The history of merging forms a binary tree or

hierarchy.

Single-Link

The similarity between a pair of clusters is defined by the single strongest link (i.e., maximum cosine-similarity) between their members:

After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(max),(,

yxsimccsimji cycx

),(),,(max)),(( kjkikji ccsimccsimcccsim

HAC – Example

d3,d4,d5

As clusters agglomerate, docs are likely to fall into a dendrogram.

HAC – Example Single-Link

Take Home Message

Single-Link HAC Dendrogram

),(max),(,

yxsimccsimji cycx

),(),,(max)),(( kjkikji ccsimccsimcccsim

information retrieval lecture 7 introduction to information retrieval (manning et al. 2007) chapter...

single cluster

zebo slide

hac example singlelink

dendrogram clustering

resulting cluster

separate cluster

university of london

reuters rcv1 slide

Documents

xml retrieval with slides of c. manning und h.schutze

information retrieval - stanford university · 5/21/17 1...

information retrieval lecture 8 introduction to information...

information retrieval lecture 4 introduction to information...

introduction to information...

epl660: information retrieval and search ...epl660:...

introduction to information retrieval introduction to...

1 crawling slides adapted from –information retrieval and...

cs276 information retrieval and web search christopher...

information retrieval lecture 3 introduction to information...

introduction to information retrieval (manning, raghavan,...

cs276 information retrieval and web search chris manning and...

cs276 information retrieval and web search chris manning and...

introduction to information...

information retrieval - stanford universityinformation...

cs276: information retrieval and web search christopher...

probabilistic information retrieval chris manning, pandu...

information retrieval – language models for...

introduction to information retrieval introduction to...

information retrieval - stanford university · information...