link distribution on wikipedia [0407]kwanghee park

19
Link Distribution on Wikipedia [0407]KwangHee Park

Upload: alban-white

Post on 08-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

Introduction  Why focused on Link  When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others  Assumption  Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles

TRANSCRIPT

Page 1: Link Distribution on Wikipedia [0407]KwangHee Park

Link Distribution on Wikipedia

[0407]KwangHee Park

Page 2: Link Distribution on Wikipedia [0407]KwangHee Park

Table of contents Introduction Topic modeling Preliminary Problem Conclusion

Page 3: Link Distribution on Wikipedia [0407]KwangHee Park

Introduction Why focused on Link

When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others

Assumption Link terms in the Wikipedia articles is the key terms which

can represent specific characteristic of articles

Page 4: Link Distribution on Wikipedia [0407]KwangHee Park

Introduction Problem what we want to solve is

To analyses latent distribution of set of Target document by topic modeling

Page 5: Link Distribution on Wikipedia [0407]KwangHee Park

Topic modeling Topic

Topics are latent concepts buried in the textual artifacts of a community described by a collection of many terms that co-occur frequently in context

Laura Dietz, Avaré Stewart 2006 ‘Utilize Probabilistic Topic Models to Enrich Knowledge Bases’

T = {Wi,…,Wn }

Page 6: Link Distribution on Wikipedia [0407]KwangHee Park

Topic modeling Bag of word assumption

The bag-of-words model is a simplifying assumption used in natural language processing and information re-trieval. In this model, a text (such as a sentence or a doc-ument) is represented as an unordered collection of words, disregarding grammar and even word order.

From Wikipedia

Each document in the corpus is represented by a vector of integer {f1,f2,…f|w|} Fi = frequency of i th word |w| = number of words

Page 7: Link Distribution on Wikipedia [0407]KwangHee Park

Topic modeling Instead of directly associating documents to words,

associate each document with some topics and each topic with some significant words

Document = {Tn, Tk,…,Tm } {Doc : 1 } {Tn : 0.4, Tk : 0.3 ,… }

Page 8: Link Distribution on Wikipedia [0407]KwangHee Park

Topic modeling Based upon the idea that documents are mixtures of

topics Modeling

Document topic term

Page 9: Link Distribution on Wikipedia [0407]KwangHee Park

Topic modeling LSA

performs dimensionality reduction using the singular value decomposition.

The transformed word– document co-occurrence matrix, X, is factorized into three smaller matrices, U, D, and V. U provides an orthonormal basis for a spatial representation of

words D weights those dimensions V provides an orthonormal basis for a spatial representation of

documents.

Page 10: Link Distribution on Wikipedia [0407]KwangHee Park

Topic modeling pLSA

Observed word distributions

word distributionsper topic

Topic distributionsper document

K

kjkkiji dzpzwpdwp

1

)|()|()|(

Page 11: Link Distribution on Wikipedia [0407]KwangHee Park

Topic modeling LDA (Latent Dirichlet Allocation) Number of parameters to be estimated in pLSA

grows with size of training set In this point LDA method has advantage

Alpha and beta are corpus-level documents that are sampled once in the corpus creating generative model (outside of the plates!)

pLSA LDA

Page 12: Link Distribution on Wikipedia [0407]KwangHee Park

Topic modeling – our approach Target

Document = Wikipedia article Terms = linked term in document

Modeling method LDA

Modeling tool Lingpipe api

Page 13: Link Distribution on Wikipedia [0407]KwangHee Park

Advantage of linked term Don’t need to extra preprocessing

Boundary detection Remove stopword Word stemming

Include more semantics Co-relation between term and document Ex) cancer as a term cancer as a document

cancer

A Cancer

Page 14: Link Distribution on Wikipedia [0407]KwangHee Park

Preliminary Problem How well link terms in the document are represent

specific characteristic of that document

Link evaluation Calculate similarity between document

Page 15: Link Distribution on Wikipedia [0407]KwangHee Park

Link evaluation Similarity based evaluation

Calculate similarity between terms Sim_t{term1,term2}

Calculate similarity between documents Sim_d{doc1,doc2}

Compare two similarity

Page 16: Link Distribution on Wikipedia [0407]KwangHee Park

Link evaluation Sim_t

Similarity between terms Not affected input term set

Sim_d Similarity between documents Significantly affected input term set

p,q = topic distribution of each document

Lin 1991

Page 17: Link Distribution on Wikipedia [0407]KwangHee Park

Link evaluation Compare top 10 most similar each link

Ex )Link A Term list most similar to A as term Document list most similar to A as document Compare two list – number of overlaps

Now under experiment

Page 18: Link Distribution on Wikipedia [0407]KwangHee Park

Conclusion Topic modeling with link distribution in Wikipedia Need to measure how well link distribution can rep-

resent each article’s characteristic After that analysis topic distribution in variety way Expect topic distribution can be apply many applica-

tion

Page 19: Link Distribution on Wikipedia [0407]KwangHee Park

Thank