a phrase mining framework for recursive construction of a topical hierarchy
DESCRIPTION
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy. Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky , Nihit Desai , Yinan Zhang, Phuong Nguyen, Thrivikrama Taula , Jiawei Han Advisor : Prof. Jia -Ling, Koh Speaker : Shun-Chen, Cheng. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/1.jpg)
A Phrase Mining Framework for Recursive Construction of a Topical HierarchyDate:2014/04/15Source:KDD’13Authors:Chi Wang, Marina Danilevsky, Nihit Desai, Yinan Zhang, Phuong Nguyen, Thrivikrama Taula, Jiawei HanAdvisor:Prof. Jia-Ling, KohSpeaker:Shun-Chen, Cheng
![Page 2: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/2.jpg)
2
Outline
IntroductionCATHY FrameworkExperimentConclusions
![Page 3: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/3.jpg)
3
Introduction
Jaguar
Car Animal Sport Music
A high quality hierarchical organization of the concepts ina dataset at different levels of granularity has many valuable applications such as search, summarization, and content browsing.
![Page 4: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/4.jpg)
4
Introduction
content-representative documents : if it may serve as a concise description of its accompanying full document.
recursively constructing a hierarchy of topics from a collection of content-representative documents.
![Page 5: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/5.jpg)
5
Introduction
Challenges : 1. Topical phrases that would be regarded as high
quality by human users are likely to vary in length.
2. Globally frequent phrases are not assured to be good representations for individual topics.
support vector machines
VS feature selection
![Page 6: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/6.jpg)
6
CATHYFramework
Produce subtopic
subnetworks
Extract candidate phrases
Rank topical phrases
Recursively apply step2 – 5 to each
subtopic to construct the
hierarchy
Step1
Step2
Step3
Step4
Step5
Construct term co-
occurrence network
![Page 7: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/7.jpg)
7
Outline
IntroductionCATHY FrameworkExperimentConclusions
![Page 8: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/8.jpg)
8
Problem formulation
Definition 1 : Phrase an unordered set of n terms : W : the set of all unique terms in a content-
representative document collection. Ex: mining frequent patterns = frequent pattern
mining Definition 2 : Topical Hierarchy
}|,...,{1
WwwwPin xxx
t0
par(t4) = t0
t1
t2
t3
t4
}4,3,2,1{0 ttttC t
4tP {Machine translation, statistical machine translation, word sense disambiguation, named entity})( 4tt Pr ranking score for the phrases in
topic t4
![Page 9: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/9.jpg)
9
Problem formulation
Definition 3 : Topical Frequency the count of the number of times the phrase is
attributed to topic t)(Pf t
*We use phrases as the basic units for constructing a topical hierarchy.
![Page 10: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/10.jpg)
10
Construct term co-occurrence network
Root node o is associated with the term co-occurrence network constructed from the collection of content-representative documents.
a set of nodes W, a set of links E
Ex: words in content-representative documents = {A, B, C, D, E}
oG
:Go
A
BD
E C
number of links 3, means 3 documents containing both D and E.
EDEe
: a co-occurrence of the two terms(B, C) in a document
BCe
![Page 11: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/11.jpg)
11
Clustering-EM algorithm • total link frequency between wi and wj :
ABD
E C
topic1
topic2
topic3
K child topics(K=3)
k
1zzijij ee
zijeestimate for z =
1 … k.
For every non-root node t ≠o, we construct a subnetwork by clustering the term co-occurrence network of its parent par(t). has all of the nodes from
tGtG par(t)G
t1Gt11G
t12G t13G
![Page 12: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/12.jpg)
12
Clustering-EM algorithm
ABD
E Cz
DD z)|p(w
The probability of generating a topic-z link between wi and wj is thereforeIf repeat the generation of topic-z links for iterations
z
The probability of generating a topic-z link between wi and wj = Bernoulli trial with success probability
When is large,
z
![Page 13: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/13.jpg)
13
Clustering-EM algorithm
=
ijzj
ziz ek ,
![Page 14: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/14.jpg)
14
Estimating Topical Frequency
EM-Algorithm:
According to the learned model parameters , we can
estimate the topical frequency for a phrase }...w{wPn1 xx
z
Normalizing
![Page 15: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/15.jpg)
15
Topical Phrase Extraction
The goal is to extract patterns with topical frequency larger than some threshold minsup for every topic z.
1. mine all frequent patterns Apriori FP-growth
2. filtering by the topical frequency estimation Problem : every subset of a frequent phrase is also a frequent
phrase.• Completeness :
if not complete, it is a subset of a longer phrase, in the sense that it
rarely occurs in a document without the presence of the longer phrase.
ex: support vector machinesvector machines
some of these subphrases should be removed
Topical frequency estimation
![Page 16: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/16.jpg)
16
Topical Phrase Extraction
Way of remove sunphrases : For each topic, we remove a phrase P if there exists a frequent phrase P’, such that . The remaining patterns are referred to as γ-maximal patterns (0≦ γ ≦ 1)
Phrase
Support = fz(P)
A 4B 4C 3
AB 3AC 2BC 3
ABC 2
If P = C,γ = 1 => fz(P’)≧fz(P) )()(,' CfBCfBCCBCP zz Remove C! => 1-maximal patterns: closed patternγ = 0 => fz(P’)≧0Remaining patterns = ABC=> 0-maximal patterns: maximal patternγ = 0.5 => fz(P’)≧0.5 * fz(P)Remaining patterns = ABC=> 0.5-maximal patterns
zP : all of the γ-maximal phrases of a topic z
![Page 17: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/17.jpg)
17
Ranking Criteria
4 Criteria for judging the quality of phrases : Coverage :
cover many documents within that topic.ex: in IR topic
information retrievalcross-language information retrieval
Phraseness :it is only frequent in documents belonging to that
topic and not frequent in documents within other topics.ex: in Databases topic.
query processingquery
![Page 18: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/18.jpg)
18
Ranking Criteria
Purity :they co-occur significantly more often than the
expected chance co-occurrence frequency.ex: in Machine Learning topic
active learninglearning classification
• Completeness :ex: support vector machines
vector machines
![Page 19: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/19.jpg)
19
Phrase Ranking-for each topical phrases
Coverage:
Phraseness:
Purity:
Ranking function:
n
i z
xz
n
ix
indep
mwf
zwp
zPp
i
i
1
1
)(
)|(
)|(
: the number of documents that contain at least one frequent topic-z phrase.zm
All topics
zP
![Page 20: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/20.jpg)
20
Outline
IntroductionCATHY FrameworkExperimentConclusions
![Page 21: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/21.jpg)
21
Experiment
Datasets DBLP : a set of titles of recently published computer
science papers in the areas related to DB, DM, IR, ML, NLP. 33,313 titles consisting of 18,598 unique terms.
Library : titles of books in 6 general categories. 33,372 titles consisting of 3,556 unique terms
Methods for comparison SpecClus hPAM hPAMrr CATHYcp: only considers the coverage and purity
criteria (γ =1, ω =0) CATHY: minsup=5, γ=ω=0.5
![Page 22: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/22.jpg)
22
Experiment- subtree of output rooted at Level2
Topic : Information Retrieval
Work with DBLP dataset.
One of nodes in L2
![Page 23: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/23.jpg)
23
Experiment- User Study
Work with DBLP dataset.
![Page 24: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/24.jpg)
24
Experiment- predict book’s category
Work with Library dataset.
![Page 25: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/25.jpg)
25
OutlineIntroductionCATHY FrameworkExperimentConclusions
![Page 26: A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815b49550346895dc928f4/html5/thumbnails/26.jpg)
26
Conclusions
Design a novel phrase mining framework to cluster, extract and rank phrases which recursively discovers specific topics from more general ones, thus constructing a top-down topical hierarchy.
Shifting from a unigram-centric to a phrase-centric view.
Validate the ability to generate high quality, human-interpretable topic hierarchies.