1 prototype hierarchy based clustering for the categorization and navigation of web collections...

Post on 23-Dec-2015

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Prototype Hierarchy Based Clustering for the Categorization

and Navigation of Web Collections

Zhao-Yan Ming, Kai Wang and Tat-Seng ChuaSchool of Computing, National University of Singapore

SIGIR 2010Speaker: Tom Chao Zhou

2010.10.26, Tuesday

2

Outline

•Motivation

•Prototype Hierarchy Based Clustering

•Problem Formulation and Approach

•Experiments

3

Outline

•Motivation

•Prototype Hierarchy Based Clustering

•Problem Formulation and Approach

•Experiments

4

Motivation

•Utility of user-generated-contents

•Quality: distinguish good, bad quality content.

•Accessibility: • question search

• Organizing the huge collections of data for information navigation: Categorization, hierarchical clustering with labels and descriptions of clusters.

5

Categorization

•Users to construct fine-grained topic hierarchies and assign objects

•Open Directory Project and Wikipedia

•Disadvantage: too many manual efforts.

•Coarse grain hierarchies

•Yahoo! Answers’ categories.

•Disadvantage: too coarse, does not have “IPod”.

6

Categorization

•Supervised techniques. Not appropriate for dynamic Web services.

•Unsupervised

•Clustering the collections into smaller groups.

•Extracting labels for clustered groups.

7

Prototype Hierarchy based Clustering (PHC)

•Tackle web collection categorization and navigation problem.

•PHC utilizes the world knowledge in the form of prototype hierarchies, while adapts to the underlying topic structures of the collections.

8

Prototype Hierarchy based Clustering (PHC)

•Advantages

• Eliminate the problem of determining the number of clusters and assigning initial clusters by following the structure of the prototype hierarchy.

• Results are interpretable, comprehensive, and organized.

• Flexible forms of supervision: prototype hierarchy can come in different level of granularity.

9

Outline

•Motivation

•Prototype Hierarchy Based Clustering

•Problem Formulation and Approach

•Experiments

10

Prototype Hierarchy Based Clustering

•Prototype Hierarchy (PH)

•A hierarchy whose nodes set V represent a set of <l,p> tuples. p: prototype serving as description of concept l.

•Data Hierarchy (DH)

•A hierarchy organizes a collection of objects d. Each node represents a category of objects CO.

11

Problem Formulation

•Given a collection D of objects on a topic τ, PHC partitions and maps D into the categories that are predefined by a PH on τ, such that the formed objects clusters CO1, CO2,..., COk are organized in a DH with similar structures.

12

•some PH node does not have objects. •some questions have no appropriate category to assign to.

13

Requirements

•Data hierarchy is evolving into a compact structure encoding the underlying topics of the collection.

•Data and prototype hierarchy matched at both node and relation level.

•Distance between objects are measured by appropriate metrics.

14

Outline

•Motivation

•Prototype Hierarchy Based Clustering

•Problem Formulation and Approach

•Experiments

15

Problem Formulation and Approach

•Hierarchy Metric and Information Function

•A hierarchy metric as a function that operates on all nodes.

•h: V×V->R+, adjacent pair of nodes ,

•Quality of the structure measured by the amount of information carried in H.

16

Minimum Evolution

•Minimum Evolution (obj1)

•Intuition :DH that compactly “encodes” the collection into topic categories is the best.

•Monitor the structural evolution of the data hierarchy.

•The optimal DH on a collection is the one that contains the least information.

17

Matching of Prototype Data Hierarchy

•Data Hierarchy Centroid

•Centroids of DH nodes are generated in an incremental manner.

•New object in a leaf node automatically becomes member of its ancestor nodes.

•Magnitude of the change decreases with the levels from the leaf node.

18

Prototype Centrality

•Prototype centrality (obj2)

•Intuition: Adding a data object into a node, so that the updated centroids are most similar to their corresponding prototypes.

•A prototype is located at the center of an object cluster.

19

Prototype-Data Hierarchy Resemblance

•Matching between two hierarchies H1, H2

•Full match, V1=V2 and R1=R2.

•Partial match• common hierarchy: matched nodes and relations.

• Incomplete match: V1+Vin=V2,R1+Rin=R2

• Excess match:V1=V2+Vin,R1=R2+Rin

20

21

Prototype-Data Hierarchy Resemblance

•Prototype-Data Hierarchy Resemblance (obj3)

•Common part of the data hierarchy and the prototype hierarchy.

22

Partially Matched Prototype Hierarchy

•PH is an incomplete match of DH

•Adding dummy child nodes to the existing nodes in PH.

•Employ label extraction algorithms.

•PH is an excess match of DH

•Empty nodes will be removed.

23

Object Metric

•M(di,dj) defined as the similarity between a pair of objects di and dj within a node.

•Translation-based Language Model.

•semantic.

•Syntactic Tree Kernel Matching.

•syntactic.

24

Category Cohesiveness

•Category Cohesiveness (obj4)

•Objects in the same category are similar to each other.

•Objects in different categories are dissimilar to each other.

25

Multi-Criterion Optimization Function

•Minimum evolution.

•Prototype centrality.

•Prototype-Data Hierarchy Resemblance.

•Category cohesiveness.

26

Outline

•Motivation

•Prototype Hierarchy Based Clustering

•Problem Formulation and Approach

•Experiments

27

Datasets

• Hierarchy

• Dental: Wikipedia

• IPod: manually constructed by combining Wikipedia article, Wordnet, product spec.

• Dataset diversity

• CS: deep hierarchy. Hierarchies are noise.

• RS: broad hierarchy, abstract domain. Hierarchies are noise.

• IPod: concrete domain.

• Dental: Hierarchy is well constructed.

28

Experimental Setting

•proKmeans

•Prototype hierarchy enhanced K-means divisive hierarchical clustering.

•LiveClassifier

•PHC

•CFC Classifier

•Supervised text categorization technique.

29

•Specifying a prototype hierarchy for a collection, even a simple method can categorize the collection reasonable well.•PHC is superior in terms of utilizing the prototype hierarchy.•Comparable with supervised method.•PHC introduces new nodes into predefined hierarchy.•PHC works better in concrete domains than on abstract domains.

30

Ablation Study on Optimization Objectives

• Prototype Centrality(obj2)

• Category Cohesiveness(obj4)

• Prototype-Data Hierarchy Resemblance.(obj3)

• Minimum Evolution(obj1)

• Data hierarchy varies less from the prototype hierarchy without minimum evolution. (create new node with minimum evolution)

• Minimum evolution objective leads to a self-contained data hierarchy.

31

Robustness with Mismatched Prototype

Hierarchy

•PHC is robust against overfitted prototype hierarchies.•PHC has only limited ability to create categories.

top related