discovering and describing coherent and meaningful topics ...clustering & ptm are unsupervised...
TRANSCRIPT
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Discovering and Describing Coherent andMeaningful Topics from a Text Collection
Henry Anaya-Sanchez
IR&NLP-UNED
May 7th, 2013
1 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Content
1 Introduction
2 Discovering topics from term pairs
3 Methodology
4 Evaluation
5 Conclusions
2 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Motivation
i. The need of information systems and users for analyzing,structuring, and summarizing large collections of textdocuments according to the main subject themes thatrun over the collection documents (i.e., their topics).
ii. Traditional approaches to discover and describe topicsbased on clustering and Probabilistic Topic Modeling(PTM) are insufficient to always provide ostensibleend-users with coherent and meaningful topics.
3 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Motivation
Clustering & PTM are unsupervised learning techniques thathas been widely used in the process of topic discovery fromdocuments.
i. Clustering methods aim at generating document groups orclusters, each one representing a different topic.
ii. PTM approaches focus on learning a set of worddistributions aimed at generating each document in acollection to represent the topics.
4 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Motivation
However, the obtained clusters/distributions do not necessarilycorrespond to actual topics of interest:
i. They do not always correlate with human judgements soas to always provide ostensible end-users withsemantically coherent (interpretable, subject-based) andmeaningful (main theme vs. background) topics thatsummarize the content comprised in a text collection.
5 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Motivation
However, the obtained clusters/distributions do not necessarilycorrespond to actual topics of interest:
i. They are actually clusters/probability distributions ofwords with a statistically meaning that sometimes aredifficult to interpret and explain by humans since theinformation they convey in many cases is not at all relatedto a subject.
6 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Motivation
On the other hand,
i. Clustering methods do not provide descriptions thatsummarize the clusters’ contents (so that users can judgetheir relevance).
ii. The descriptions provided by PTM approaches arecurrently limited to list the most probable (frequent) termsunder each distribution.
7 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Motivation
This talk presents an approache for discovering topics focusedon:
1 how to discover the semantically coherent and meaningful(interpretable, subject-heading like) topics comprised in atext collection, and
2 how to simultaneously provide an appropriate descriptionfor each topic so that humans can easily judge itsrelevance.
8 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
The methodology is closely related to the series of works: FIHC(Fung et al., 2003), CFWS (Li et al., 2008) and the methodproposed by Malik and Kender (2006); that aim at obtainingsimultaneously both the coverage of a topic and its descriptionby means of a new clustering criterion based on the concept offrequent term set (i.e. a set of terms that co-occur in at least aminimum number of documents in the text collection).
In these works, document clusters and their descriptions aredetermined by the frequent term sets of the documentcollection.
9 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
Similarly, our approach relies on highly probable term pairsgenerated from the collection. However, we use these pairs onlyas a guide to explore the possible topics of the collection.
Topics and their descriptions are generated from term pairsdeemed to be representative of a collection topic.
We introduce the concept of homogeneity of a document set,which is aimed at checking if a set of documents is cohesiveenough to represent a topic.
10 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
11 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
We define the probability of generating a pair of terms{ti, tj} ∈ P from C as:
P({ti, tj}|C) =∑d∈C
P(ti|d)P(tj |d)P(d|C) (1)
12 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
We propose a novel way to estimate the homogeneity of a setof documents (specifically, for the support set of a given termpair) by analyzing its possible content coverage.
It relies on the concept of pure entropy of a partitionΘ = {Θ1, . . . ,Θq}:
H(Θ) = −q∑
i=1
P (Θi|Θ) log2 P (Θi|Θ) (2)
where P (Θi|Θ) can be estimated as |Θi|/q∑
j=1|Θj |.
13 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
Additionally, we provide larger descriptions based on thelikelihood ratio score (Dunning, 1993).
This score has been widely used for estimating the correlationof terms with respect to a target topic (Lin and Hovy, 2000;Harabagiu and Lacatusu, 2005).
14 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
For evaluating this approach we have used three benchmarkcollections: AFP,3 TDT2 version 4.0 and Reuters-21578.4 TheReuters dataset is composed by stories that have been taggedwith the attribute TOPICS = YES and include a BODY part.
These collections are different in terms of number of topics,topic sizes, number of dimensions and document distribution.
15 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
16 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
A topic discovery methodology based on term pairs
17 / 18
Discoveringand
DescribingCoherent and
MeaningfulTopics from a
TextCollection
Henry Anaya-Sanchez
Introduction
Discoveringtopics fromterm pairs
Methodology
Evaluation
Conclusions
Conclusions
- A new methodology for discovering and describing thecoherent and meaninful topics comprised in a textcollection has been introduced.
- The proposed algorithm provides a novel parameter-lessmethod for discovering the topics from the collection, atthe same time that it attaches suitable descriptions to thediscovered topics.
- The experiments carried out over TDT2 English corpus,AFP Spanish collection and Reuters-21578 validate ourproposal and show significant improvements over existingmethods in terms of the standard macro- andmicro-averaged F1 measures.
18 / 18