latent dirichlet allocation

Latent Dirichlet Allocation

David M Blei, Andrew Y Ng & Michael I Jordan

presented by Tilaye Alemu & Anand Ramkissoon

Motivation for LDA

In lay terms: document modelling text classification collaborative filtering ... ...in the context of Information Retrieval

The principal focus in this paper is on document classification within a corpus

Structure of this talk

Part 1: Theory Background (some) other approaches

Part 2: Experimental results some details of usage wider applications

LDA: conceptual features

Generative Probabilistic Collections of discrete data

3-level hierarchical Bayesian model mixture models efficient approximate inference techniques variational methods EM algorithm for empirical Bayes parameter

estimation

How to classify text documents

Word (term) frequency tf-idf

term-by-document matrix discriminative sets of words fixed-length lists of numbers little statistical structure

Dimensionality reduction techniques Latent Semantic Indexing

Singular value decomposition not generative

How to classify text documents ct'd

probabilistic LSI (PLSI) each word generated by one topic each document generated by a mixture of topics a document is represented as a list of mixing

proportions for topics

No generative model for these numbers Number of parameters grows linearly with the corpus Overfitting How to classify documents outside training set

A major simplifying assumption

A document is a “bag of words” A corpus is a “bag of documents”

order is unimportant exchangeability de Finetti representation theorem

any collection of exchangeable random variables has a representation as a (generally infinite) mixture distribution

A note about exchangeability

Does not mean that random variables are iid iid when conditioned on wrt to an underlying

latent parameter of a probability distribution

Conditionally the joint distribution is simple and factored

Notation

word: unit of discrete data, an item from a vocabulary indexed {1,...,V} each word is a unit basis V-vector

document: sequence of N words w=(w1,...,w

corpus a collection of M documents D=(w1,...,w

Each document is considered a random mixture over latent topics

Each topic is considered a distribution over words

LDA assumes a generative processfor each document in the corpus

Probability density for the DirichletRandom variable

Joint distribution of a Topic mixture

Marginal distribution of a document

Probability of a corpus

Marginalize over z

The word distribution

The generative process

a Unigram Model

probabilistic Latent Semantic Indexing

Inference from LDA

Variational Inference

A family of distributions on latent variables

The Dirichlet parameter γ and the multinomial parameters φ are the free variational parameters

The update equations

Minimize the Kullback-Leibler divergence between the distribution and the true posterior

Variational Inference Algorithm

latent dirichlet allocation

document matrixdiscriminative

topiceach document

wmeach document

document classification

mixture of topicsa document

underlying latent parameter

random mixture

latent topicseach topic

Documents

blei2003-latent dirichlet allocation

latent dirichlet allocation - mit...

document classiﬁcation with latent dirichlet...

lia$at$trec$2012$web$track: unsupervised$search$concepts...

latent dirichlet markov allocation for sentiment...

latent dirichlet allocation uncovers spectral

notes on latent dirichlet allocation (lda) for beginners

a robust latent dirichlet allocation approach for the

latent dirichlet allocation - neural information...

latent dirichlet allocation (nicolas loeff)

latent dirichlet allocation - stanford artificial...

max-margin latent dirichlet allocation for image...

latent dirichlet allocation -...

latent dirichlet allocation: towards a deeper...

class-speciﬁc simplex-latent dirichlet allocation c...

latent dirichlet allocation

lecture 13 & 14: latent dirichlet allocation for topic...

latent dirichlet allocation - stanford...

latent dirichlet allocation - university of california,...

latent dirichlet allocation...