topic models and latent dirichlet allocation.docx

Upload: naveed-mastan

Post on 14-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 TOPIC MODELS AND LATENT DIRICHLET ALLOCATION.docx

    1/2

    In addition, the LDA model was designed to prevent the distribution of probabilities is

    the choice of topic is dependent on previously known documents (this means that the model is

    able to take into account documents that are not known in advance). All this comes fromtechnical choices not easy to understand. In particular, it mentions the name of Dirichlet.

    Dirichlet distribution as a probabilistic choice allows efficient multinomial distributions (instead

    of drawing lots as a topic with PLSI, we shoot first draw method draw on themes).

    Fig. 2 Plane Notation Of LDA Algorithm.

    LDA assumes the following generative process can be described as follows (Blei et al., 2003).For each document in a corpus D:

    1. Choose topic distributioni ~ Dirichlet() where i {1M} and Dirichlet () is the Dirichlet distribution for

    parameter

    2. For each of the words wij in the document, where j {1,.,Ni}(a)Choose a specific topic

    zij ~ Multi (i)Where multi () is a multinomial

    (b) Choose a word wij ~ zijHere,

    w represents a words z represents a vector of topics, is a k X V word-probability matrix for each topic (row) and each term column),

    Where ij =p (wj = 1|zi= 1)

    Plane notation is shown in figure1

    Step (a) reflects that each document contains topics in different proportion

    For instance consider one document may contain a many words taken from the topic onclimate and no words taken from the topic about diseases, while a different document may

    have an equal number of words drawn from both topics. Step (ii) reflects that each individual

    word in the document is drawn from one of the K topics in proportion to the document'sdistribution over topics as determined in Step (i). The selection of each word depends on the

    distribution over the V words in vocabulary as determined by the selected topic j. The

  • 7/29/2019 TOPIC MODELS AND LATENT DIRICHLET ALLOCATION.docx

    2/2

    generative model does not make any assumptions about how the orders of the words in the

    documents are present. This is known as the bag-of-words assumption. The main objective of

    topic modeling is to discover topics from a document collection automatically. In Fig. 3,

    and parameters constitute to the outermost level of the model, parameter d forms themiddle level and parameters Zdn, Wdn are at the innermost level of the model. Parameters

    and are corpus level parameters. These corpus level parameters are assumed to be sampledonce in the process of corpus generation. The variables d are document-level variables

    sampled once per document and the variables Zdn and Wdn are at the word level. We haveused the LDA implementation with the help mallet package in java.

    Fig. 3. Graphical Notation Representing LDA Model.

    SYSTEM ARCHITECTURE: