recommendation with a reason - statistical and applied ...€¦ · recommendation with a reason:...
TRANSCRIPT
Recommendation with a Reason:Collaborative Topic Modeling for Recommending Scientific Articles
Chong Wang
Joint work with David M. BleiComputer Science Department, Princeton University
SAMSI Computational Advertising Workshop
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 1 / 65
Recommendation is common
Type “laptop” in Amazon,
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 2 / 65
What would you do?
Read every description yourself.Ask a good friend to do it.Ask many “friends” to do it.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 3 / 65
What would you do?
Read every description yourself.
Ask a good friend to do it.Ask many “friends” to do it.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 3 / 65
What would you do?
Read every description yourself.Ask a good friend to do it.
Ask many “friends” to do it.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 3 / 65
What would you do?
Read every description yourself.Ask a good friend to do it.Ask many “friends” to do it.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 3 / 65
What would you do?
Read every description yourself.Ask a good friend to do it.Ask many “friends” to do it.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 3 / 65
What do they say?
Sorted by Avg. Customer Review
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 4 / 65
So, where are those recommender systems?
and more …..
But I also do research in machine learning...
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 5 / 65
So, where are those recommender systems?
and more …..
But I also do research in machine learning...
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 5 / 65
Today’s focus — Finding Scientific articles
A search of “machine learning” in Google gives 2,520,000 results.
already read
If I have read article A, B and C, what should I read next?
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 6 / 65
Today’s focus — Finding Scientific articles
A search of “machine learning” in Google gives 2,520,000 results.
already read
If I have read article A, B and C, what should I read next?
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 6 / 65
Finding relevant articles
1 An important task for researchers,learn about the general ideas,learn about the-state-of-the-art.
2 Some existing approaches,following article references: easily missing relevant citations.using keyword search:
difficult to form queries,only good for directed exploration.
3 We wish to improve user experiences on both recommendation andexploration experiences.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 7 / 65
Finding relevant articles
1 An important task for researchers,learn about the general ideas,learn about the-state-of-the-art.
2 Some existing approaches,following article references: easily missing relevant citations.
using keyword search:difficult to form queries,only good for directed exploration.
3 We wish to improve user experiences on both recommendation andexploration experiences.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 7 / 65
Finding relevant articles
1 An important task for researchers,learn about the general ideas,learn about the-state-of-the-art.
2 Some existing approaches,following article references: easily missing relevant citations.using keyword search:
difficult to form queries,only good for directed exploration.
3 We wish to improve user experiences on both recommendation andexploration experiences.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 7 / 65
Finding relevant articles
1 An important task for researchers,learn about the general ideas,learn about the-state-of-the-art.
2 Some existing approaches,following article references: easily missing relevant citations.using keyword search:
difficult to form queries,only good for directed exploration.
3 We wish to improve user experiences on both recommendation andexploration experiences.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 7 / 65
CiteULike
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 8 / 65
Mendeley
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 9 / 65
Example user library, from CiteULike
1 Probabilistic latent semantic indexing2 Optimizing search engines using clickthrough data3 A re-examination of text categorization methods4 On Discriminative vs. Generative Classifiers: A comparison of logistic
regression and naive Bayes5 A Comparative Study on Feature Selection in Text Categorization6 ...
What should we recommend to this user?
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 10 / 65
Example user library, from CiteULike
1 Probabilistic latent semantic indexing2 Optimizing search engines using clickthrough data3 A re-examination of text categorization methods4 On Discriminative vs. Generative Classifiers: A comparison of logistic
regression and naive Bayes5 A Comparative Study on Feature Selection in Text Categorization6 ...
What should we recommend to this user?
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 10 / 65
Demo
http://www.cs.princeton.edu/˜chongw/citeulike/
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 11 / 65
User can browse his/her interests
User’s interests are summarized using top topics he/she is interested in.
http://www.cs.princeton.edu/~chongw/citeulike/users/user2832.html
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 12 / 65
User can read the recommendations
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 13 / 65
User can also explore recommendations
See how other people read this article,
in th
e arti
cle
what people see
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 14 / 65
Outline
1 Matrix factorization for recommendation
2 Topic modeling
3 Collaborative topic models
4 Empirical Results
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 15 / 65
Outline
1 Matrix factorization for recommendation
2 Topic modeling
3 Collaborative topic models
4 Empirical Results
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 16 / 65
Matrix factorization for recommendation
Three important elements:
users: you and me
items: article
ratings: a user likes/dislikes some of the articles
Popular solutions: collaborative filtering (CF)
matrix factorization: one of the most popular algorithms forrecommender system.
The user-item matrix,
useritem
1
2
3
1 2 3
✔ ✖
✔ ✖
✔ ✖?
?
?2664
1 0 ?
1 ? 0
? 1 0
3775
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 17 / 65
Matrix factorization for recommendation
Three important elements:
users: you and me
items: article
ratings: a user likes/dislikes some of the articles
Popular solutions: collaborative filtering (CF)
matrix factorization: one of the most popular algorithms forrecommender system.
The user-item matrix,
useritem
1
2
3
1 2 3
✔ ✖
✔ ✖
✔ ✖?
?
?2664
1 0 ?
1 ? 0
? 1 0
3775
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 17 / 65
Matrix factorization for recommendation
Three important elements:
users: you and me
items: article
ratings: a user likes/dislikes some of the articles
Popular solutions: collaborative filtering (CF)
matrix factorization: one of the most popular algorithms forrecommender system.
The user-item matrix,
useritem
1
2
3
1 2 3
✔ ✖
✔ ✖
✔ ✖?
?
?2664
1 0 ?
1 ? 0
? 1 0
3775
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 17 / 65
Matrix factorization
Users and items are represented in a shared but unknown latent space,user i — ui ∈ RK .item j — vj ∈ RK .
Each dimension of the latent space is assumed to represent some kind ofunknown common interest.
The rating of item j by user i is achieved by the dot product,
rij = uTi vj,
where rij = 1 indicates like and 0 dislike.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 18 / 65
Matrix factorization
Users and items are represented in a shared but unknown latent space,user i — ui ∈ RK .item j — vj ∈ RK .
Each dimension of the latent space is assumed to represent some kind ofunknown common interest.
The rating of item j by user i is achieved by the dot product,
rij = uTi vj,
where rij = 1 indicates like and 0 dislike.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 18 / 65
Matrix factorization
Users and items are represented in a shared but unknown latent space,user i — ui ∈ RK .item j — vj ∈ RK .
Each dimension of the latent space is assumed to represent some kind ofunknown common interest.
The rating of item j by user i is achieved by the dot product,
rij = uTi vj,
where rij = 1 indicates like and 0 dislike.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 18 / 65
Probabilistic formulation of matrix factorization (PMF)
1 For each user i, draw user latent vector ui ∼N (0,λ−1u IK).
2 For each item j, draw item latent vector vj ∼N (0,λ−1v IK).
3 For each user-item pair (i, j), draw rating rij ∼N (uTi vj,c−1
ij ), where cij isthe precision parameter.
The expectation of the rating is,
E[rij] = uTi vj.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 19 / 65
Probabilistic formulation of matrix factorization (PMF)
1 For each user i, draw user latent vector ui ∼N (0,λ−1u IK).
2 For each item j, draw item latent vector vj ∼N (0,λ−1v IK).
3 For each user-item pair (i, j), draw rating rij ∼N (uTi vj,c−1
ij ), where cij isthe precision parameter.
The expectation of the rating is,
E[rij] = uTi vj.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 19 / 65
Probabilistic formulation of matrix factorization (PMF)
1 For each user i, draw user latent vector ui ∼N (0,λ−1u IK).
2 For each item j, draw item latent vector vj ∼N (0,λ−1v IK).
3 For each user-item pair (i, j), draw rating rij ∼N (uTi vj,c−1
ij ), where cij isthe precision parameter.
The expectation of the rating is,
E[rij] = uTi vj.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 19 / 65
Probabilistic formulation of matrix factorization (PMF)
1 For each user i, draw user latent vector ui ∼N (0,λ−1u IK).
2 For each item j, draw item latent vector vj ∼N (0,λ−1v IK).
3 For each user-item pair (i, j), draw rating rij ∼N (uTi vj,c−1
ij ), where cij isthe precision parameter.
The expectation of the rating is,
E[rij] = uTi vj.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 19 / 65
Learning and Prediction
Learning the latent vectors for users and items,
minU,V ∑
i,j(rij−uT
i vj)2 +λu||ui||2 +λv||vj||2,
where λu and λv are regularization parameters.
Prediction: for user i on item j (not rated by user i before),
rij ≈ uTi vj.
How do we understand these latent vectors for users and items???
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 20 / 65
Learning and Prediction
Learning the latent vectors for users and items,
minU,V ∑
i,j(rij−uT
i vj)2 +λu||ui||2 +λv||vj||2,
where λu and λv are regularization parameters.
Prediction: for user i on item j (not rated by user i before),
rij ≈ uTi vj.
How do we understand these latent vectors for users and items???
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 20 / 65
An commonly used example to explain matrix factorization
machine learning
information retrieval
👲
👵👱
👶👨
📖
📖
📖
📖
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 21 / 65
Understanding the latent space is helpful
Provide new ways of exploration —not just list of articles (items).
Increase users’ confidence of the system —more likely to accept recommendations.
Provide new ways of diagnosing the systems —not just accuracy, etc.
Unfortunately, matrix factorization does not have it through an easy way.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 22 / 65
Understanding the latent space is helpful
Provide new ways of exploration —not just list of articles (items).
Increase users’ confidence of the system —more likely to accept recommendations.
Provide new ways of diagnosing the systems —not just accuracy, etc.
Unfortunately, matrix factorization does not have it through an easy way.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 22 / 65
Understanding the latent space is helpful
Provide new ways of exploration —not just list of articles (items).
Increase users’ confidence of the system —more likely to accept recommendations.
Provide new ways of diagnosing the systems —not just accuracy, etc.
Unfortunately, matrix factorization does not have it through an easy way.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 22 / 65
Understanding the latent space is helpful
Provide new ways of exploration —not just list of articles (items).
Increase users’ confidence of the system —more likely to accept recommendations.
Provide new ways of diagnosing the systems —not just accuracy, etc.
Unfortunately, matrix factorization does not have it through an easy way.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 22 / 65
Understanding the latent space is helpful
Provide new ways of exploration —not just list of articles (items).
Increase users’ confidence of the system —more likely to accept recommendations.
Provide new ways of diagnosing the systems —not just accuracy, etc.
Unfortunately, matrix factorization does not have it through an easy way.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 22 / 65
A good article recommender system
It should be able to1 recommend old articles (easy),
2 recommend new articles (not that easy, but doable)
3 provide new ways of exploration — not just a list of items(challenging)
Our goal is not only to improve the recommendation performance, butalso the exploration experience.
We use the interpretability of topic modeling to improve exploration.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 23 / 65
A good article recommender system
It should be able to1 recommend old articles (easy),
2 recommend new articles (not that easy, but doable)
3 provide new ways of exploration — not just a list of items(challenging)
Our goal is not only to improve the recommendation performance, butalso the exploration experience.
We use the interpretability of topic modeling to improve exploration.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 23 / 65
A good article recommender system
It should be able to1 recommend old articles (easy),
2 recommend new articles (not that easy, but doable)
3 provide new ways of exploration — not just a list of items(challenging)
Our goal is not only to improve the recommendation performance, butalso the exploration experience.
We use the interpretability of topic modeling to improve exploration.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 23 / 65
A good article recommender system
It should be able to1 recommend old articles (easy),
2 recommend new articles (not that easy, but doable)
3 provide new ways of exploration — not just a list of items(challenging)
Our goal is not only to improve the recommendation performance, butalso the exploration experience.
We use the interpretability of topic modeling to improve exploration.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 23 / 65
A good article recommender system
It should be able to1 recommend old articles (easy),
2 recommend new articles (not that easy, but doable)
3 provide new ways of exploration — not just a list of items(challenging)
Our goal is not only to improve the recommendation performance, butalso the exploration experience.
We use the interpretability of topic modeling to improve exploration.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 23 / 65
A good article recommender system
It should be able to1 recommend old articles (easy),
2 recommend new articles (not that easy, but doable)
3 provide new ways of exploration — not just a list of items(challenging)
Our goal is not only to improve the recommendation performance, butalso the exploration experience.
We use the interpretability of topic modeling to improve exploration.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 23 / 65
Outline
1 Matrix factorization for recommendation
2 Topic modeling
3 Collaborative topic models
4 Empirical Results
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 24 / 65
Topic modeling
gene 0.04dna 0.02genetic 0.01
…
gene 0.04dna 0.02genetic 0.01.,,
life 0.02evolve 0.01organism 0.01.,,
brain 0.04neuron 0.02nerve 0.01...
data 0.02number 0.02computer 0.01.,,
Topics Documents Topic proportions andassignments
Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.
model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)
We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.
1. Randomly choose a distribution over topics.
2. For each word in the document
(a) Randomly choose a topic from the distribution over topics in step #1.
(b) Randomly choose a word from the corresponding distribution over the vocabulary.
This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with di↵erent proportion (step #1); each word in each document
1Technically, the model assumes that the topics are generated first, before the documents.
3
life 0.02 evolve 0.01 organism 0.01 …
data 0.02number 0.02computer 0.01 …
Topics Documents Topic proportions
Each topic is a distribution over words
Each document is a mixture of corpus-wide topics
Each word is drawn from one of those topics
Adapted from Introduction to Probabilistic Topic Models, D. Blei 2011.Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 25 / 65
Latent Dirichlet allocation
Latent Dirichlet allocation (LDA) is a popular topic model. It assumes
There are K topics.
For each article, topic proportions θ ∼ Dirichlet(α).
gene 0.04dna 0.02genetic 0.01
…
life 0.02 evolve 0.01 organism 0.01 …
data 0.02number 0.02computer 0.01 …
Topicstopic proportions θj
Note that θ can explain the topics that article talks about!
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 26 / 65
Latent Dirichlet allocation
Latent Dirichlet allocation (LDA) is a popular topic model. It assumes
There are K topics.
For each article, topic proportions θ ∼ Dirichlet(α).
gene 0.04dna 0.02genetic 0.01
…
life 0.02 evolve 0.01 organism 0.01 …
data 0.02number 0.02computer 0.01 …
Topicstopic proportions θj
Note that θ can explain the topics that article talks about!
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 26 / 65
Latent Dirichlet allocation
Latent Dirichlet allocation (LDA) is a popular topic model. It assumes
There are K topics.
For each article, topic proportions θ ∼ Dirichlet(α).
gene 0.04dna 0.02genetic 0.01
…
life 0.02 evolve 0.01 organism 0.01 …
data 0.02number 0.02computer 0.01 …
Topicstopic proportions θj
Note that θ can explain the topics that article talks about!
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 26 / 65
Latent Dirichlet allocation
Latent Dirichlet allocation (LDA) is a popular topic model. It assumes
There are K topics.
For each article, topic proportions θ ∼ Dirichlet(α).
gene 0.04dna 0.02genetic 0.01
…
life 0.02 evolve 0.01 organism 0.01 …
data 0.02number 0.02computer 0.01 …
Topicstopic proportions θj
Note that θ can explain the topics that article talks about!
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 26 / 65
LDA as a graphical modelLDA as a graphical model
�d Zd,n Wd,nN
D K�k
� !
Dirichlet
parameter
Per-document
topic proportions
Per-word
topic assignment
Observed
word Topics
Topic
hyperparameter
• Vertices denote random variables.• Edges denote dependence between random variables.• Shading denotes observed variables.• Plates denote replicated variables.
Adapted from David Blei’s slides.Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 27 / 65
Running a topic model
Data: article titles+abstracts from CiteUlike.16,980 articles1.6M words8K unique terms
Model: 200-topic LDA model with variational inference.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 28 / 65
Inferred topics for the whole corpus
gene genes
expression tissues
regulationcoexpressiontissuespecific
expressed tissue
regulatory
nodes wireless protocol routing
protocols node
sensor peertopeer
scalable hoc
distribution random
probability distributions
sampling stochastic
markov density
estimation statistics
learning machine training
vector learn
machines kernel
learned classifiers classifier
relative importance
give original respect obtain
ranking metric
weighted compute
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 29 / 65
Inferred topic proportions for article I
relative importance give original respect obtain ranking
large small numbers larger extremely amounts smallerweb semantic pages page metadata standards rdf xml
topic proportions
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 30 / 65
Inferred topic proportions for article II
estimate estimates likelihood maximum estimated missing
distribution random probability distributions sampling stochasticalgorithm signal input signals output exact performs music
topic proportions
Maximum Likelihood from Incomplete Data via the EM Algorithm
By A. P. DEMPSTER,N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service
[Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTIONon Wednesday, December 8th, 1976, Professor S. D. SILVEYin the Chair]
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Keywords : MAXIMUM LIKELIHOOD ;INCOMPLETE DATA ;EM ALGORITHM ;POSTERIOR MODE
1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed.
The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x+ y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters.
We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). The complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...I ...) by
(1.1)
The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g'iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+).
Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 31 / 65
Topic modeling — recap
gene genes
expression tissues
regulationcoexpressiontissuespecific
expressed tissue
regulatory
nodes wireless protocol routing
protocols node
sensor peertopeer
scalable hoc
distribution random
probability distributions
sampling stochastic
markov density
estimation statistics
learning machine training
vector learn
machines kernel
learned classifiers classifier
relative importance
give original respect obtain
ranking metric
weighted compute
estimate estimates likelihood maximum estimated missing
distribution random probability distributions sampling stochasticalgorithm signal input signals output exact performs music
topic proportions
Maximum Likelihood from Incomplete Data via the EM Algorithm
By A. P. DEMPSTER,N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service
[Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTIONon Wednesday, December 8th, 1976, Professor S. D. SILVEYin the Chair]
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Keywords : MAXIMUM LIKELIHOOD ;INCOMPLETE DATA ;EM ALGORITHM ;POSTERIOR MODE
1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed.
The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x+ y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters.
We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). The complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...I ...) by
(1.1)
The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g'iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+).
Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 32 / 65
Comparison of the article representations
estimate estimates likelihood maximum estimated missing
distribution random probability distributions sampling stochasticalgorithm signal input signals output exact performs music
topic modeling
Maximum Likelihood from Incomplete Data via the EM Algorithm
By A. P. DEMPSTER,N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service
[Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTIONon Wednesday, December 8th, 1976, Professor S. D. SILVEYin the Chair]
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Keywords : MAXIMUM LIKELIHOOD ;INCOMPLETE DATA ;EM ALGORITHM ;POSTERIOR MODE
1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed.
The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x+ y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters.
We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). The complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...I ...) by
(1.1)
The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g'iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+).
Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.
matrix factorization????????????
????????????????????????
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 33 / 65
Outline
1 Matrix factorization for recommendation
2 Topic modeling
3 Collaborative topic models
4 Empirical Results
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 34 / 65
Collaborative topic regression: motivations–cont’
machine learning social network analysis
Two topics
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 35 / 65
Collaborative topic regression: motivations–cont’
machine learning social network analysis
Two topics
Two people:
👸👮boy girl
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 35 / 65
Collaborative topic regression: motivations–cont’
article A article B
machine learning social network analysis
Two topics
Two articles:
Two people:
👸👮boy girl
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 35 / 65
Collaborative topic regression: motivations–cont’
article A article B
machine learning social network analysis
Two topics
Two articles:
Two people:
The boy likes A and B --- no problem.The girl likes A and B --- problem?
Preferences:
👸👮boy girl
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 35 / 65
The basic idea
Notations:
Item latent vector v: users’ view an article,
Topic proportions θ : article content.
Assumption: Item vector v is close to θ , but could diverge from it.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 36 / 65
The basic idea
Notations:
Item latent vector v: users’ view an article,
Topic proportions θ : article content.
Assumption: Item vector v is close to θ , but could diverge from it.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 36 / 65
The graphical model
topic proportions
item latent vector v ⇠ N (✓,��1v IK)
For each article,1 Draw topic proportions θ ∼ Dirichlet(α).
2 Draw item vector v∼N (θ ,λ−1v Ik). Or draw latent offset
ε ∼N (0,λ−1v IK) and set the item latent vector as v = θ + ε .
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 37 / 65
The graphical model
topic proportions
item latent vector v ⇠ N (✓,��1v IK)
For each article,1 Draw topic proportions θ ∼ Dirichlet(α).2 Draw item vector v∼N (θ ,λ−1
v Ik). Or draw latent offsetε ∼N (0,λ−1
v IK) and set the item latent vector as v = θ + ε .
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 37 / 65
Infer the hidden variables
We develop an EM-style algorithm to find the maximum a posteriori (MAP)estimates.
item latent vector topic proportions
{user rating information relative "weight"
ui (V CiVT + �uIK)�1V CiRi
vj (UCjUT + �vIK)�1(UCjRj + �v✓j)
{ update is the same as matrix factorizationuser latent vector
if U = 0 (no user ratings), vj = ✓j
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 38 / 65
Make predictions
We consider two scenarios,
In-matrix prediction: items have been rated before.
Out-of-matrix prediction: items have never been rated.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 39 / 65
Outline
1 Matrix factorization for recommendation
2 Topic modeling
3 Collaborative topic models
4 Empirical Results
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 40 / 65
Experimental settings
1 Two datasets:CiteUlike: 5,551 users, 16,980 articles, and 204,986 bibliography entries.(Sparsity= 99.8%)Mendeley: 80,278 users, 261248 articles, and 5,120,944 bibliographyentries. (Sparsity= 99.98%)
2 Comparison: matrix factorization for matrix factorization (MF),text-based method (LDA).
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 41 / 65
Results
1 In-matrix prediction: CTR improves more when number ofrecommendations gets larger.
2 Out-of-matrix prediction: about the same as LDA.
citeulike
number of recommended articles
reca
ll
0.3
0.4
0.5
0.6
0.7
0.8
in−matrix
●
●
●●
●● ● ● ● ●
50 100
150
200
out−of−matrix
●
●
●
●
●●
●●
●●
50 100
150
200
method
● CTR LDA MF
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 42 / 65
Results
mendeley
number of recommended articles
reca
ll
0.05
0.10
0.15
0.20
in−matrix
●
●
●
●
●
●
●
●
●
●
50 100
150
out−of−matrix
●
●
●
●
●
●
●
●
●
●
50 100
150
method
● CTR LDA MF
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 43 / 65
When precision parameter λv varies
Recall λv penalizes how v could diverge from θ ,1 When λv is small, CTR behaves more like MF.2 When λv increases, CTR brings in both ratings and content.3 When λv is large, CTR behaves more like LDA.
citeulike
λv
reca
ll@10
0
0.55
0.60
0.65
0.70
0.75
0.80
in−matrix
●
●
●
●
10 100
1000
1000
0
out−of−matrix
●
● ●●
10 100
1000
1000
0
method
● CTR LDA MF
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 44 / 65
When precision parameter λv varies
mendeley
λv
reca
ll@10
0
0.04
0.06
0.08
0.10
0.12
0.14
0.16
in−matrix
●●
●
●
●
●●
1000
2000
3000
4000
5000
out−of−matrix
●●
●
●
●
● ●
1000
2000
3000
4000
5000
method
● CTR LDA MF
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 45 / 65
Example user profile, CiteULike
top topics1. image, measure, measures, images, motion, matching2. learning, machine, training, vector, learn, machines3. sets, objects, defined, categories, representations
top articles
1. Information theory inference learning algorithms2. Machine learning in automated text categorization3. Artificial intelligence a modern approach4. Data mining: practical machine learning tools . . .5. Statistical learning theory6. Modern information retrieval7. Pattern recognition and machine learning8. Recognition by components: a theory of human . . .9. Data clustering a review10. Indexing by latent semantic analysis
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 46 / 65
Example user profile, Mendeley
top topics1. auditory, speech, sound, music, perception, acoustic, sounds2. noise, threshold, stochastic, fluctuations, thresholds, model3. words, word, semantic, reading, priming, processing
top articles
1. Music perception with temporal cues in acoustic and electric . . .2. The role of melodic and temporal cues in perceiving . . .3. The Role of Spectral and Temporal Cues in Voice . . .4. Musicians have enhanced subcortical auditory and . . .5. Speech versus nonspeech in pitch memory6. Auditory organization of sound sequences by . . .7. Selective subcortical enhancement of musical intervals . . .8. The neuronal representation of pitch in primate . . .9. An auditory region in the primate insular cortex . . .10. Analysis of the spectral envelope of sounds . . .
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 47 / 65
Example article profile I, CiteULike
Maximum likelihood from incomplete data via the EM algorithm, Dempsteret al. 1977.
Topic
Weight
0.00
0.01
0.02
0.03
0.04
0.05
0.06
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
0 50 100 150 200
estimate, estimates, likelihood, maximum, estimated, missing
algorithm, signal, input,signals, output, exact
models, probabilistic, graphical, modelling, generalized, incorporate
parameters, bayesian, inference,optimal, procedure, prior
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 48 / 65
The same article from Mendeley
Maximum likelihood from incomplete data via the EM algorithm, Dempsteret al. 1977.
Topic
Weight
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
● ●
●
● ●
●
●
0 100 200 300 400 500
estimates, estimate, estimation, estimated, estimating, likelihood
algorithm, algorithms, optimization,problem, efficient, problems
image, images, segmentation, algorithm, registration, camera
bayesian, model, inference,models, probability, probabilistic
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 49 / 65
Example article profile II, CiteULike
Article: Phase-of-firing coding of natural visual stimuli in primary visualcortex.
Topic
Weight
0.0
0.1
0.2
0.3
0.4
0.5
● ●
●
● ● ●●
●
●
● ● ●● ● ● ●
●● ● ● ● ● ●
●●●
●
●
●
●● ●
0 50 100 150 200
models, probabilistic, graphical, modelling, generalized, incorporate
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 50 / 65
Flexible recommendation design
Adaptive design I:
🔳
🔳
🔳
🔳
🔳
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 51 / 65
Flexible recommendation design
Adaptive design I:
🔳
🔳
🔳
🔳
🔳
🔳
a new topic
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 52 / 65
See the full demo
http://www.cs.princeton.edu/˜chongw/citeulike/
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 53 / 65
The demo
The entry point of the demo gives three links to,
Users,
Topics,
Articles (ranked by offset and frequency)
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 54 / 65
User list page
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 55 / 65
Topic list page
These topics give an overview of what this entire collection is about.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 56 / 65
Article list page — ranked by the offset
These articles are sorted according to their offset — the divergence from theusers view from the word content.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 57 / 65
User can browse his/her interests
User’s interests are summarized using top topics he/she is interested in. Likewe saw in the previous slides.
http://www.cs.princeton.edu/~chongw/citeulike/users/user2832.html
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 58 / 65
User can read the recommendations
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 59 / 65
When a user clicks on one recommendation — articleitself
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 60 / 65
When a user clicks on one recommendation — the topics
How word content is different from the people’s view.
in th
e arti
cle
what people see
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 61 / 65
When a user clicks on one topic — related users
This gives the top users who likes this topic.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 62 / 65
When a user clicks on one topic — related documents
Related documents based on word content versus based people’s view.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 63 / 65
Future work
We would like to work on the following directions,
incorporating other ways of capturing the popularity of articles, like metadata: e.g., authors.
modeling user and item profiles over time.
extending to other domains, such as movie and productrecommendations.
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 64 / 65
The end
Thanks a lot!
Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 65 / 65