discovering and describing coherent and meaningful topics ...clustering & ptm are unsupervised...

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Discovering and Describing Coherent andMeaningful Topics from a Text Collection

Henry Anaya-Sanchez

IR&NLP-UNED

May 7th, 2013

1 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

Content

1 Introduction

2 Discovering topics from term pairs

3 Methodology

4 Evaluation

5 Conclusions

2 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

Motivation

i. The need of information systems and users for analyzing,structuring, and summarizing large collections of textdocuments according to the main subject themes thatrun over the collection documents (i.e., their topics).

ii. Traditional approaches to discover and describe topicsbased on clustering and Probabilistic Topic Modeling(PTM) are insufficient to always provide ostensibleend-users with coherent and meaningful topics.

3 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

Motivation

Clustering & PTM are unsupervised learning techniques thathas been widely used in the process of topic discovery fromdocuments.

i. Clustering methods aim at generating document groups orclusters, each one representing a different topic.

ii. PTM approaches focus on learning a set of worddistributions aimed at generating each document in acollection to represent the topics.

4 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

Motivation

However, the obtained clusters/distributions do not necessarilycorrespond to actual topics of interest:

i. They do not always correlate with human judgements soas to always provide ostensible end-users withsemantically coherent (interpretable, subject-based) andmeaningful (main theme vs. background) topics thatsummarize the content comprised in a text collection.

5 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

Motivation

However, the obtained clusters/distributions do not necessarilycorrespond to actual topics of interest:

i. They are actually clusters/probability distributions ofwords with a statistically meaning that sometimes aredifficult to interpret and explain by humans since theinformation they convey in many cases is not at all relatedto a subject.

6 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

Motivation

On the other hand,

i. Clustering methods do not provide descriptions thatsummarize the clusters’ contents (so that users can judgetheir relevance).

ii. The descriptions provided by PTM approaches arecurrently limited to list the most probable (frequent) termsunder each distribution.

7 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

Motivation

This talk presents an approache for discovering topics focusedon:

1 how to discover the semantically coherent and meaningful(interpretable, subject-heading like) topics comprised in atext collection, and

2 how to simultaneously provide an appropriate descriptionfor each topic so that humans can easily judge itsrelevance.

8 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

The methodology is closely related to the series of works: FIHC(Fung et al., 2003), CFWS (Li et al., 2008) and the methodproposed by Malik and Kender (2006); that aim at obtainingsimultaneously both the coverage of a topic and its descriptionby means of a new clustering criterion based on the concept offrequent term set (i.e. a set of terms that co-occur in at least aminimum number of documents in the text collection).

In these works, document clusters and their descriptions aredetermined by the frequent term sets of the documentcollection.

9 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions


Similarly, our approach relies on highly probable term pairsgenerated from the collection. However, we use these pairs onlyas a guide to explore the possible topics of the collection.

Topics and their descriptions are generated from term pairsdeemed to be representative of a collection topic.

We introduce the concept of homogeneity of a document set,which is aimed at checking if a set of documents is cohesiveenough to represent a topic.

10 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions


11 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions


We define the probability of generating a pair of terms{ti, tj} ∈ P from C as:

P({ti, tj}|C) =∑d∈C

P(ti|d)P(tj |d)P(d|C) (1)

12 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions


We propose a novel way to estimate the homogeneity of a setof documents (specifically, for the support set of a given termpair) by analyzing its possible content coverage.

It relies on the concept of pure entropy of a partitionΘ = {Θ1, . . . ,Θq}:

H(Θ) = −q∑

i=1

P (Θi|Θ) log2 P (Θi|Θ) (2)

where P (Θi|Θ) can be estimated as |Θi|/q∑

j=1|Θj |.

13 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions


Additionally, we provide larger descriptions based on thelikelihood ratio score (Dunning, 1993).

This score has been widely used for estimating the correlationof terms with respect to a target topic (Lin and Hovy, 2000;Harabagiu and Lacatusu, 2005).

14 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions


For evaluating this approach we have used three benchmarkcollections: AFP,3 TDT2 version 4.0 and Reuters-21578.4 TheReuters dataset is composed by stories that have been taggedwith the attribute TOPICS = YES and include a BODY part.

These collections are different in terms of number of topics,topic sizes, number of dimensions and document distribution.

15 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions


16 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions


17 / 18

Discoveringand



TextCollection

Henry Anaya-Sanchez

Introduction


Methodology

Evaluation

Conclusions

Conclusions

- A new methodology for discovering and describing thecoherent and meaninful topics comprised in a textcollection has been introduced.

- The proposed algorithm provides a novel parameter-lessmethod for discovering the topics from the collection, atthe same time that it attaches suitable descriptions to thediscovered topics.

- The experiments carried out over TDT2 English corpus,AFP Spanish collection and Reuters-21578 validate ourproposal and show significant improvements over existingmethods in terms of the standard macro- andmicro-averaged F1 measures.

18 / 18

discovering and describing coherent and meaningful topics ...clustering & ptm are unsupervised...

Documents