xuan-hieu phan le-minh nguyensusumu horiguchi
DESCRIPTION
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi GSIS, Tohoku UniversityGSIS, JAISTGSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-Jie Ko. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/1.jpg)
1
Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-
scale Data CollectionsXuan-Hieu Phan Le-Minh Nguyen Susumu HoriguchiGSIS, Tohoku University GSIS, JAIST GSIS, Tohoku
University
WWW 2008
NLG Seminar 2008/12/31Reporter:Kai-Jie Ko
![Page 2: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/2.jpg)
2
Motivation
Many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness
![Page 3: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/3.jpg)
3
Previous works to overcome data sparsenessEmploy search engines to expand and
enrich the context of data
![Page 4: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/4.jpg)
4
Previous works to overcome data sparsenessEmploy search engines to expand and
enrich the context of data
Time consuming!
![Page 5: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/5.jpg)
5
Previous works to overcome data sparsenessTo utilize online data repositories, such as
Wikipedia or Open Directory Project,as external knowledge sources
![Page 6: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/6.jpg)
6
Previous works to overcome data sparsenessTo utilize online data repositories, such as
Wikipedia or Open Directory Project,as external knowledge sources
Only used the user defined categories and concepts in those repositories, not general enough
![Page 7: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/7.jpg)
7
General framework
![Page 8: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/8.jpg)
8
(a)Choose an universal data
•Must large and rich enough to cover words, concepts that are related to the classification problem.•Wikipedia & MEDLINE are chosen in this paper.
![Page 9: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/9.jpg)
9
(a)Choose an universal data
Use topic oriented keywords to crawl Wikipedia with maximum depth of hyperlink 4◦240MB◦71,968 documents◦882,376 paragraphs◦60,649 vocabulary◦30,492,305 words
![Page 10: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/10.jpg)
10
(a)Choose an universal data
Ohsumed : a test collection of medical journal abstracts to assist IR research◦156MB◦233,442 abstracts
![Page 11: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/11.jpg)
11
(b)Doing topic analysis for the universal dataset
![Page 12: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/12.jpg)
12
(b)Doing topic analysis for the universal dataset
Using GibbsLDA++, a C/C++ implementation of LDA using Gibbs Sampling
The number of topics ranges from 10, 20 . . . to 100, 150, and 200
The hyperparameters alpha and beta were set to 0.5 and 0.1, respectively
![Page 13: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/13.jpg)
13
Hidden topics analysis for Wikipedia data
![Page 14: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/14.jpg)
14
Hidden topics analysis for the Ohsumed-MEDLINE data
![Page 15: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/15.jpg)
15
(c)Building a moderate size labeled training dataset
•Words/terms in this dataset should be relevant to as many hidden topics as possible.
![Page 16: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/16.jpg)
16
(d)Doing topic inference for training and future data
•To transform the original data into a set of topics
![Page 17: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/17.jpg)
17
Sample Google search snippets
![Page 18: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/18.jpg)
18
Snippets word co-occurence
This show the sparseness of web snippetsin that only small fraction of words are shared by the 2 or 3 different snippets
![Page 19: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/19.jpg)
19
Shared topics among snippets after inferenceAfter doing inference and integration,
snippets are more related in semantic way
![Page 20: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/20.jpg)
20
(e) Building the classifier
•Choose from different learning methods•Integrate hidden topics into the training, test, or future data according to the data representation of the chosen learning technique•Train the classifier on the integrated training data
![Page 21: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/21.jpg)
21
Evaluation
Domain disambiguation for Web search results◦To classify Google search snippets into different
domains, such as Business, Computers, Health, etc.
Disease classification for medical abstracts◦Classifies each MEDLINE medical abstract into
one of five disease categories that are related to neoplasms, digestive system, etc.
![Page 22: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/22.jpg)
22
Domain disambiguation for Web search results
Obtain Google snippet as training and testing data, the search phrase of the two data are totally exclusive
![Page 23: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/23.jpg)
23
Domain disambiguation for Web search results
The result of doing 5-fold cross validation on the training data
Reduce 19% of error on average
![Page 24: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/24.jpg)
24
Domain disambiguation for Web search results
![Page 25: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/25.jpg)
25
Domain disambiguation for Web search results
![Page 26: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/26.jpg)
26
Disease Classification for Medical Abstracts with MEDLINE Topics
The proposed method requires only 4500 training data to reachthe accuracy of the baseline which uses 22500 training data!
![Page 27: Xuan-Hieu Phan Le-Minh NguyenSusumu Horiguchi](https://reader036.vdocuments.net/reader036/viewer/2022062309/56815783550346895dc5185e/html5/thumbnails/27.jpg)
27
Conclusion
Advantages of proposed framework:◦A good method to classify sparse and previous
unseen data Utilizing the large universal dataset
◦Expanding the coverage of the classifier Topics coming from external data cover a lot of
terms/words that do not exist in training dataset◦Easy to implement
Only have to prepare a small set of labeled training example to attain high accuracy