recommendation with a reason - statistical and applied ...€¦ · recommendation with a reason:...

Recommendation with a Reason:Collaborative Topic Modeling for Recommending Scientific Articles

Chong Wang

Joint work with David M. BleiComputer Science Department, Princeton University

SAMSI Computational Advertising Workshop

Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 1 / 65

Recommendation is common

Type “laptop” in Amazon,


What would you do?

Read every description yourself.Ask a good friend to do it.Ask many “friends” to do it.


What would you do?

Read every description yourself.

Ask a good friend to do it.Ask many “friends” to do it.


What would you do?

Read every description yourself.Ask a good friend to do it.

Ask many “friends” to do it.


What would you do?

Read every description yourself.Ask a good friend to do it.Ask many “friends” to do it.


What do they say?

Sorted by Avg. Customer Review


So, where are those recommender systems?

and more …..

But I also do research in machine learning...


Today’s focus — Finding Scientific articles

A search of “machine learning” in Google gives 2,520,000 results.

already read

If I have read article A, B and C, what should I read next?


Finding relevant articles

1 An important task for researchers,learn about the general ideas,learn about the-state-of-the-art.

2 Some existing approaches,following article references: easily missing relevant citations.using keyword search:

difficult to form queries,only good for directed exploration.

3 We wish to improve user experiences on both recommendation andexploration experiences.




2 Some existing approaches,following article references: easily missing relevant citations.

using keyword search:difficult to form queries,only good for directed exploration.





2 Some existing approaches,following article references: easily missing relevant citations.using keyword search:

difficult to form queries,only good for directed exploration.



CiteULike


Mendeley


Example user library, from CiteULike

1 Probabilistic latent semantic indexing2 Optimizing search engines using clickthrough data3 A re-examination of text categorization methods4 On Discriminative vs. Generative Classifiers: A comparison of logistic

regression and naive Bayes5 A Comparative Study on Feature Selection in Text Categorization6 ...

What should we recommend to this user?


Demo

http://www.cs.princeton.edu/˜chongw/citeulike/


http://www.cs.princeton.edu/~chongw/citeulike/

User can browse his/her interests

User’s interests are summarized using top topics he/she is interested in.

http://www.cs.princeton.edu/~chongw/citeulike/users/user2832.html


User can read the recommendations


User can also explore recommendations

See how other people read this article,

in th

e arti

cle

what people see


Outline

1 Matrix factorization for recommendation

2 Topic modeling

3 Collaborative topic models

4 Empirical Results


Outline


2 Topic modeling


4 Empirical Results


Matrix factorization for recommendation

Three important elements:

users: you and me

items: article

ratings: a user likes/dislikes some of the articles

Popular solutions: collaborative filtering (CF)

matrix factorization: one of the most popular algorithms forrecommender system.

The user-item matrix,

useritem

1

2

3

1 2 3

✔ ✖

✔ ✖

✔ ✖?

?

?2664

1 0 ?

1 ? 0

? 1 0

3775


Matrix factorization

Users and items are represented in a shared but unknown latent space,user i — ui ∈ RK .item j — vj ∈ RK .

Each dimension of the latent space is assumed to represent some kind ofunknown common interest.

The rating of item j by user i is achieved by the dot product,

rij = uTi vj,

where rij = 1 indicates like and 0 dislike.


Probabilistic formulation of matrix factorization (PMF)

1 For each user i, draw user latent vector ui ∼N (0,λ−1u IK).

2 For each item j, draw item latent vector vj ∼N (0,λ−1v IK).

3 For each user-item pair (i, j), draw rating rij ∼N (uTi vj,c−1

ij ), where cij isthe precision parameter.

The expectation of the rating is,

E[rij] = uTi vj.


Learning and Prediction

Learning the latent vectors for users and items,

minU,V ∑

i,j(rij−uT

i vj)2 +λu||ui||2 +λv||vj||2,

where λu and λv are regularization parameters.

Prediction: for user i on item j (not rated by user i before),

rij ≈ uTi vj.

How do we understand these latent vectors for users and items???


An commonly used example to explain matrix factorization

machine learning

information retrieval

👲

👵👱

👶👨

📖

📖

📖

📖


Understanding the latent space is helpful

Provide new ways of exploration —not just list of articles (items).

Increase users’ confidence of the system —more likely to accept recommendations.

Provide new ways of diagnosing the systems —not just accuracy, etc.

Unfortunately, matrix factorization does not have it through an easy way.


A good article recommender system

It should be able to1 recommend old articles (easy),

2 recommend new articles (not that easy, but doable)

3 provide new ways of exploration — not just a list of items(challenging)

Our goal is not only to improve the recommendation performance, butalso the exploration experience.

We use the interpretability of topic modeling to improve exploration.


Outline


2 Topic modeling


4 Empirical Results


Topic modeling

gene 0.04dna 0.02genetic 0.01

…

gene 0.04dna 0.02genetic 0.01.,,

life 0.02evolve 0.01organism 0.01.,,

brain 0.04neuron 0.02nerve 0.01...

data 0.02number 0.02computer 0.01.,,

Topics Documents Topic proportions andassignments

Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.

model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)

We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.

1. Randomly choose a distribution over topics.

2. For each word in the document

(a) Randomly choose a topic from the distribution over topics in step #1.

(b) Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with di↵erent proportion (step #1); each word in each document

1Technically, the model assumes that the topics are generated first, before the documents.

3

life 0.02 evolve 0.01 organism 0.01 …

data 0.02number 0.02computer 0.01 …

Topics Documents Topic proportions

Each topic is a distribution over words

Each document is a mixture of corpus-wide topics

Each word is drawn from one of those topics

Adapted from Introduction to Probabilistic Topic Models, D. Blei 2011.Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 25 / 65

Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is a popular topic model. It assumes

There are K topics.

For each article, topic proportions θ ∼ Dirichlet(α).

gene 0.04dna 0.02genetic 0.01

…

life 0.02 evolve 0.01 organism 0.01 …

data 0.02number 0.02computer 0.01 …

Topicstopic proportions θj

Note that θ can explain the topics that article talks about!


LDA as a graphical modelLDA as a graphical model

�d Zd,n Wd,nN

D K�k

� !

Dirichlet

parameter

Per-document

topic proportions

Per-word

topic assignment

Observed

word Topics

Topic

hyperparameter

• Vertices denote random variables.• Edges denote dependence between random variables.• Shading denotes observed variables.• Plates denote replicated variables.

Adapted from David Blei’s slides.Chong Wang (Princeton) Collaborative Topic Modeling August 07, 2012 27 / 65

Running a topic model

Data: article titles+abstracts from CiteUlike.16,980 articles1.6M words8K unique terms

Model: 200-topic LDA model with variational inference.


Inferred topics for the whole corpus

gene genes

expression tissues

regulationcoexpressiontissuespecific

expressed tissue

regulatory

nodes wireless protocol routing

protocols node

sensor peertopeer

scalable hoc

distribution random

probability distributions

sampling stochastic

markov density

estimation statistics

learning machine training

vector learn

machines kernel

learned classifiers classifier

relative importance

give original respect obtain

ranking metric

weighted compute


Inferred topic proportions for article I

relative importance give original respect obtain ranking

large small numbers larger extremely amounts smallerweb semantic pages page metadata standards rdf xml

topic proportions


Inferred topic proportions for article II

estimate estimates likelihood maximum estimated missing

distribution random probability distributions sampling stochasticalgorithm signal input signals output exact performs music

topic proportions

Maximum Likelihood from Incomplete Data via the EM Algorithm

By A. P. DEMPSTER,N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service

[Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTIONon Wednesday, December 8th, 1976, Professor S. D. SILVEYin the Chair]

A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Keywords : MAXIMUM LIKELIHOOD ;INCOMPLETE DATA ;EM ALGORITHM ;POSTERIOR MODE

1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed.

The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x+ y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters.

We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). The complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...I ...) by

(1.1)

The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g'iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+).

Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.


Topic modeling — recap

gene genes

expression tissues

regulationcoexpressiontissuespecific

expressed tissue

regulatory

nodes wireless protocol routing

protocols node

sensor peertopeer

scalable hoc

distribution random

probability distributions

sampling stochastic

markov density

estimation statistics

learning machine training

vector learn

machines kernel

learned classifiers classifier

relative importance

give original respect obtain

ranking metric

weighted compute



topic proportions









(1.1)




Comparison of the article representations



topic modeling









(1.1)



matrix factorization????????????

????????????????????????


Outline


2 Topic modeling


4 Empirical Results


Collaborative topic regression: motivations–cont’

machine learning social network analysis

Two topics




Two topics

Two people:

👸👮boy girl



article A article B


Two topics

Two articles:

Two people:

👸👮boy girl



article A article B


Two topics

Two articles:

Two people:

The boy likes A and B --- no problem.The girl likes A and B --- problem?

Preferences:

👸👮boy girl


The basic idea

Notations:

Item latent vector v: users’ view an article,

Topic proportions θ : article content.

Assumption: Item vector v is close to θ , but could diverge from it.


The graphical model

topic proportions

item latent vector v ⇠ N (✓,��1v IK)

For each article,1 Draw topic proportions θ ∼ Dirichlet(α).

2 Draw item vector v∼N (θ ,λ−1v Ik). Or draw latent offset

ε ∼N (0,λ−1v IK) and set the item latent vector as v = θ + ε .


The graphical model

topic proportions

item latent vector v ⇠ N (✓,��1v IK)

For each article,1 Draw topic proportions θ ∼ Dirichlet(α).2 Draw item vector v∼N (θ ,λ−1

v Ik). Or draw latent offsetε ∼N (0,λ−1

v IK) and set the item latent vector as v = θ + ε .


Infer the hidden variables

We develop an EM-style algorithm to find the maximum a posteriori (MAP)estimates.

item latent vector topic proportions

{user rating information relative "weight"

ui (V CiVT + �uIK)�1V CiRi

vj (UCjUT + �vIK)�1(UCjRj + �v✓j)

{ update is the same as matrix factorizationuser latent vector

if U = 0 (no user ratings), vj = ✓j


Make predictions

We consider two scenarios,

In-matrix prediction: items have been rated before.

Out-of-matrix prediction: items have never been rated.


Outline


2 Topic modeling


4 Empirical Results


Experimental settings

1 Two datasets:CiteUlike: 5,551 users, 16,980 articles, and 204,986 bibliography entries.(Sparsity= 99.8%)Mendeley: 80,278 users, 261248 articles, and 5,120,944 bibliographyentries. (Sparsity= 99.98%)

2 Comparison: matrix factorization for matrix factorization (MF),text-based method (LDA).


Results

1 In-matrix prediction: CTR improves more when number ofrecommendations gets larger.

2 Out-of-matrix prediction: about the same as LDA.

citeulike

number of recommended articles

reca

ll

0.3

0.4

0.5

0.6

0.7

0.8

in−matrix

●

●

●●

●● ● ● ● ●

50 100

150

200

out−of−matrix

●

●

●

●

●●

●●

●●

50 100

150

200

method

● CTR LDA MF


Results

mendeley

number of recommended articles

reca

ll

0.05

0.10

0.15

0.20

in−matrix

●

●

●

●

●

●

●

●

●

●

50 100

150

out−of−matrix

●

●

●

●

●

●

●

●

●

●

50 100

150

method

● CTR LDA MF


When precision parameter λv varies

Recall λv penalizes how v could diverge from θ ,1 When λv is small, CTR behaves more like MF.2 When λv increases, CTR brings in both ratings and content.3 When λv is large, CTR behaves more like LDA.

citeulike

λv

reca

ll@10

0

0.55

0.60

0.65

0.70

0.75

0.80

in−matrix

●

●

●

●

10 100

1000

1000

0

out−of−matrix

●

● ●●

10 100

1000

1000

0

method

● CTR LDA MF


When precision parameter λv varies

mendeley

λv

reca

ll@10

0

0.04

0.06

0.08

0.10

0.12

0.14

0.16

in−matrix

●●

●

●

●

●●

1000

2000

3000

4000

5000

out−of−matrix

●●

●

●

●

● ●

1000

2000

3000

4000

5000

method

● CTR LDA MF


Example user profile, CiteULike

top topics1. image, measure, measures, images, motion, matching2. learning, machine, training, vector, learn, machines3. sets, objects, defined, categories, representations

top articles

1. Information theory inference learning algorithms2. Machine learning in automated text categorization3. Artificial intelligence a modern approach4. Data mining: practical machine learning tools . . .5. Statistical learning theory6. Modern information retrieval7. Pattern recognition and machine learning8. Recognition by components: a theory of human . . .9. Data clustering a review10. Indexing by latent semantic analysis


Example user profile, Mendeley

top topics1. auditory, speech, sound, music, perception, acoustic, sounds2. noise, threshold, stochastic, fluctuations, thresholds, model3. words, word, semantic, reading, priming, processing

top articles

1. Music perception with temporal cues in acoustic and electric . . .2. The role of melodic and temporal cues in perceiving . . .3. The Role of Spectral and Temporal Cues in Voice . . .4. Musicians have enhanced subcortical auditory and . . .5. Speech versus nonspeech in pitch memory6. Auditory organization of sound sequences by . . .7. Selective subcortical enhancement of musical intervals . . .8. The neuronal representation of pitch in primate . . .9. An auditory region in the primate insular cortex . . .10. Analysis of the spectral envelope of sounds . . .


Example article profile I, CiteULike

Maximum likelihood from incomplete data via the EM algorithm, Dempsteret al. 1977.

Topic

Weight

0.00

0.01

0.02

0.03

0.04

0.05

0.06

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

● ● ●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

0 50 100 150 200

estimate, estimates, likelihood, maximum, estimated, missing

algorithm, signal, input,signals, output, exact

models, probabilistic, graphical, modelling, generalized, incorporate

parameters, bayesian, inference,optimal, procedure, prior


The same article from Mendeley

Maximum likelihood from incomplete data via the EM algorithm, Dempsteret al. 1977.

Topic

Weight

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

●

●

●

●

●

●● ●

●

●

●●

●

●

●

●

●

●● ●

●

●

●

● ●

●

● ●

●

●

0 100 200 300 400 500

estimates, estimate, estimation, estimated, estimating, likelihood

algorithm, algorithms, optimization,problem, efficient, problems

image, images, segmentation, algorithm, registration, camera

bayesian, model, inference,models, probability, probabilistic


Example article profile II, CiteULike

Article: Phase-of-firing coding of natural visual stimuli in primary visualcortex.

Topic

Weight

0.0

0.1

0.2

0.3

0.4

0.5

● ●

●

● ● ●●

●

●

● ● ●● ● ● ●

●● ● ● ● ● ●

●●●

●

●

●

●● ●

0 50 100 150 200

models, probabilistic, graphical, modelling, generalized, incorporate


Flexible recommendation design

Adaptive design I:

🔳

🔳

🔳

🔳

🔳


Flexible recommendation design

Adaptive design I:

🔳

🔳

🔳

🔳

🔳

🔳

a new topic


See the full demo

http://www.cs.princeton.edu/˜chongw/citeulike/


http://www.cs.princeton.edu/~chongw/citeulike/

The demo

The entry point of the demo gives three links to,

Users,

Topics,

Articles (ranked by offset and frequency)


User list page


Topic list page

These topics give an overview of what this entire collection is about.


Article list page — ranked by the offset

These articles are sorted according to their offset — the divergence from theusers view from the word content.


User can browse his/her interests

User’s interests are summarized using top topics he/she is interested in. Likewe saw in the previous slides.

http://www.cs.princeton.edu/~chongw/citeulike/users/user2832.html


User can read the recommendations


When a user clicks on one recommendation — articleitself


When a user clicks on one recommendation — the topics

How word content is different from the people’s view.

in th

e arti

cle

what people see


When a user clicks on one topic — related users

This gives the top users who likes this topic.


When a user clicks on one topic — related documents

Related documents based on word content versus based people’s view.


Future work

We would like to work on the following directions,

incorporating other ways of capturing the popularity of articles, like metadata: e.g., authors.

modeling user and item profiles over time.

extending to other domains, such as movie and productrecommendations.


The end

Thanks a lot!