link distribution on wikipedia [0407]kwanghee park
DESCRIPTION
Introduction Why focused on Link When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others Assumption Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articlesTRANSCRIPT
![Page 1: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/1.jpg)
Link Distribution on Wikipedia
[0407]KwangHee Park
![Page 2: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/2.jpg)
Table of contents Introduction Topic modeling Preliminary Problem Conclusion
![Page 3: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/3.jpg)
Introduction Why focused on Link
When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others
Assumption Link terms in the Wikipedia articles is the key terms which
can represent specific characteristic of articles
![Page 4: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/4.jpg)
Introduction Problem what we want to solve is
To analyses latent distribution of set of Target document by topic modeling
![Page 5: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/5.jpg)
Topic modeling Topic
Topics are latent concepts buried in the textual artifacts of a community described by a collection of many terms that co-occur frequently in context
Laura Dietz, Avaré Stewart 2006 ‘Utilize Probabilistic Topic Models to Enrich Knowledge Bases’
T = {Wi,…,Wn }
![Page 6: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/6.jpg)
Topic modeling Bag of word assumption
The bag-of-words model is a simplifying assumption used in natural language processing and information re-trieval. In this model, a text (such as a sentence or a doc-ument) is represented as an unordered collection of words, disregarding grammar and even word order.
From Wikipedia
Each document in the corpus is represented by a vector of integer {f1,f2,…f|w|} Fi = frequency of i th word |w| = number of words
![Page 7: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/7.jpg)
Topic modeling Instead of directly associating documents to words,
associate each document with some topics and each topic with some significant words
Document = {Tn, Tk,…,Tm } {Doc : 1 } {Tn : 0.4, Tk : 0.3 ,… }
![Page 8: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/8.jpg)
Topic modeling Based upon the idea that documents are mixtures of
topics Modeling
Document topic term
![Page 9: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/9.jpg)
Topic modeling LSA
performs dimensionality reduction using the singular value decomposition.
The transformed word– document co-occurrence matrix, X, is factorized into three smaller matrices, U, D, and V. U provides an orthonormal basis for a spatial representation of
words D weights those dimensions V provides an orthonormal basis for a spatial representation of
documents.
![Page 10: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/10.jpg)
Topic modeling pLSA
Observed word distributions
word distributionsper topic
Topic distributionsper document
K
kjkkiji dzpzwpdwp
1
)|()|()|(
![Page 11: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/11.jpg)
Topic modeling LDA (Latent Dirichlet Allocation) Number of parameters to be estimated in pLSA
grows with size of training set In this point LDA method has advantage
Alpha and beta are corpus-level documents that are sampled once in the corpus creating generative model (outside of the plates!)
pLSA LDA
![Page 12: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/12.jpg)
Topic modeling – our approach Target
Document = Wikipedia article Terms = linked term in document
Modeling method LDA
Modeling tool Lingpipe api
![Page 13: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/13.jpg)
Advantage of linked term Don’t need to extra preprocessing
Boundary detection Remove stopword Word stemming
Include more semantics Co-relation between term and document Ex) cancer as a term cancer as a document
cancer
A Cancer
![Page 14: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/14.jpg)
Preliminary Problem How well link terms in the document are represent
specific characteristic of that document
Link evaluation Calculate similarity between document
![Page 15: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/15.jpg)
Link evaluation Similarity based evaluation
Calculate similarity between terms Sim_t{term1,term2}
Calculate similarity between documents Sim_d{doc1,doc2}
Compare two similarity
![Page 16: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/16.jpg)
Link evaluation Sim_t
Similarity between terms Not affected input term set
Sim_d Similarity between documents Significantly affected input term set
p,q = topic distribution of each document
Lin 1991
![Page 17: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/17.jpg)
Link evaluation Compare top 10 most similar each link
Ex )Link A Term list most similar to A as term Document list most similar to A as document Compare two list – number of overlaps
Now under experiment
![Page 18: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/18.jpg)
Conclusion Topic modeling with link distribution in Wikipedia Need to measure how well link distribution can rep-
resent each article’s characteristic After that analysis topic distribution in variety way Expect topic distribution can be apply many applica-
tion
![Page 19: Link Distribution on Wikipedia [0407]KwangHee Park](https://reader036.vdocuments.net/reader036/viewer/2022082600/5a4d1af47f8b9ab0599805bb/html5/thumbnails/19.jpg)
Thank