diversiweb2011 08 mining diverse views from related articles - ravali pochampally
TRANSCRIPT
![Page 1: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/1.jpg)
Ravali Pochampally
Kamal Karlapalem
[IIIT Hyderabad]
![Page 2: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/2.jpg)
• WWW- diverse content
- 100+ articles on major topics
• Google News/Amazon- organized
- (yet) too much text
2
![Page 3: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/3.jpg)
• Condenses information
• Lacks Organization
3
salient points length α (1/content) user-specified parameters
× delineation of issues× model diversity× too long (?)
![Page 4: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/4.jpg)
• A view intends to represent an issue pertaining to a set of related1 articles- organized (multiple concise views)
- information exploration
- detailed snapshot
• Example2 : review dataset (hotel)
- views [positive, negative, food, facilities]
- summary [unorganized]
1. articles concerning a common topic (FIFA 2010, swine flu in India etc.)2. http://sites.google.com/site/diverseviews/comparison
4
![Page 5: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/5.jpg)
![Page 6: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/6.jpg)
![Page 7: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/7.jpg)
• Related Work
• Problem
• Extraction of views• Ranking
• Results• Discussion
7
![Page 8: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/8.jpg)
• Allison et. al• idea of multiple view-points
• framework
• Tombros et. al • clustering of top-ranking sentences
• TextTiling 1
• divide text into multi-paragraph units
• unit represents a sub-topic
1. M. A. Hearst 1997 8
![Page 9: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/9.jpg)
Related
Articles
Problem[Mining Diverse Views]
Ranked set of views
9
![Page 10: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/10.jpg)
10
* http://sites.google.com/site/diverseviews/datasets
![Page 11: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/11.jpg)
• Datasets• sources: google news, amazon.com, tripadvisor.com
• crawling + parsing [html and rss]
• Data cleaning & pre-processing• stopwords, stemming and duplicates
• word-frequency, TF-IDF 1
111. http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
![Page 12: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/12.jpg)
• Main idea• We score each sentence in our dataset and extract the
top-ranked ones. These sentences are used to generate views [Pruning]
• We assign a Importance (I) score to each sentence
• Importance Ik of a sentence Sk belonging to article dj of length r is
12
Ik = Πr Ti,j
r
Ti,j = TF-IDF of wi є (Sk Λ dj)
![Page 13: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/13.jpg)
• A measure of similarity is required to extract views from sentences• Semantic similarity : likeness of meaning
• Mihalcea et. al • specificity of a word can be determined by its idf
• we use word-to-word similarity &
specificity - to calculate semantic similarity
13
![Page 14: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/14.jpg)
• Semantic similarity between sentences Si and Sj
where w represents a word in a sentence is
(symmetric relation & range Є [0,1])
• Need to define maxSim(w, Sj)• Wordnet : sets of cognitive synonyms (synsets)
• wup1 : based on path length between synsets
141. Z. Wu 1994
![Page 15: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/15.jpg)
• Clustering is used to extract views from the set of important sentences
• Hierarchical Agglomerative Clustering (HAC) was used• upper triangle [symmetric matrix]
• no restrictions on # of clusters
• terminate clustering when scoring parameter converges
• We treat clusters as views discussing similar content15
![Page 16: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/16.jpg)
• Focus on average pair-wise similarity between the sentences in a view
• Cohesion
C = Σi,j Є V Si,j
len(v)Si,j = sim(Ti , Tj )
16
V = set of sentences (Ti) in the viewlen(v) = # of sentences in the view
![Page 17: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/17.jpg)
17
• Most relevant view (MRV)• preference to views discussing similar content [greater
cohesion]
• top-ranked view
• Outlier view (OV)• single sentence
• low semantic similarity with other sentences
• Cohesion = 0 [ordered by importance]
![Page 18: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/18.jpg)
18
Related
Articles
Data Cleaning and Preprocessing
Extract Top-rankedsentences
ClusteringEngine
Ranking(Cohesion)
Ranked Views MRV & OV
HTML + Text Raw Text
Top-n Sentences
ViewsRanked Views
![Page 19: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/19.jpg)
• Number of top-ranking sentences (n) vs. cohesion• n which can maximize cohesion
• median cohesion >= mean [outliers]
• 20 <= n <= 35
• incremental clustering 1
• More top-ranking sentences need not necessarily lead to views with better cohesion
19
1. M. Charikar STOC 1997
![Page 20: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/20.jpg)
20
![Page 21: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/21.jpg)
• IR model as an alternate to summarization
multiple diverse views
easily navigable
browse top x views
detailed (yet organized) snapshot of a ToI
clustering at sentence/phrase level*
• Future work• polarity of a view
• user feedback• Implicit [clicks, time-spent]
• Explicit [user-ratings]
21* as opposed to document clustering
![Page 22: Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochampally](https://reader034.vdocuments.net/reader034/viewer/2022042514/558a668dd8b42a544a8b46c3/html5/thumbnails/22.jpg)
Thanks!