probabilistic topic model for hybrid recommender systems: a

40
Probabilistic Topic Model for Hybrid Recommender Systems: A Stochastic Variational Bayesian Approach Asim Ansari Yang Li Jonathan Z. Zhang 1 July 2016 1 Asim Ansari is the William T. Dillard Professor of Marketing at Columbia Business School, Yang Li is Assistant Professor of Marketing at Cheung Kong Graduate School of Business in Beijing, and Jonathan Z. Zhang is Assistant Professor of Marketing at University of Washington in Seattle.

Upload: truongtram

Post on 13-Feb-2017

221 views

Category:

Documents


2 download

TRANSCRIPT

Probabilistic Topic Model for Hybrid Recommender

Systems: A Stochastic Variational Bayesian Approach

Asim Ansari Yang Li Jonathan Z. Zhang1

July 2016

1Asim Ansari is the William T. Dillard Professor of Marketing at Columbia Business School, Yang Liis Assistant Professor of Marketing at Cheung Kong Graduate School of Business in Beijing, and JonathanZ. Zhang is Assistant Professor of Marketing at University of Washington in Seattle.

Abstract

Recommendation systems are becoming increasingly popular and useful in e-commerce and dig-ital product contexts that involve heterogeneous consumers and a vast collection of products. Thiscreates representational challenges as features that adequately describe the products are often notreadily available. User-generated content in the form of product reviews or product tags can beleveraged to obtain rich representations for subsequent product recommendation and targeting. Inthis paper we develop a novel covariate guided supervised topic model (CGSTM) that succinctlycharacterizes products in terms of latent topics and specifies consumer preferences through thesesemantic features. At the same time, recommendation contexts generate big data problems stem-ming from data volume, variety and veracity, such as our setting that includes massive textual andnumerical data. To overcome the computational challenges due to the combination of big data anda complex model, we develop a novel stochastic variational Bayesian (SVB) framework to achievefast, scalable and accurate estimation. We use our SVB approach to estimate our model on adataset of 8.87 million movie ratings from 111,793 customers and 233,268 textual movie tags. Theresults show our model generates much better predictions than those from a benchmark model, andyields interesting insights about movie preferences. We also illustrate how our model can be usedfor targeting recommendations to particular users, and how it can support personalized search bydetermining relevant products and by generating a personalized ranking of these relevant products.

Keywords: Hybrid Recommendation Models, Personalized Search, User-Generated Content, Prob-abilistic Topic Models, Big Data, Scalable Inference, Stochastic Variational Bayes.

1 Introduction

Over the past decade, e-commerce firms, online retailers, and digital content providers such as

Amazon, Netflix and New York Times have become increasingly reliant on recommender systems

to target products and digital content to users. Recommender systems are particularly useful in

environments that are characterized by a large number of users who face a vast array of products to

choose from. In such contexts, there is considerable heterogeneity in user preferences for product

attributes and the large number of products implies that users are often unaware or uncertain of

products that might appeal to them. Moreover, such environments are often constantly evolving as

new users and new items are added on a regular basis. Firms therefore use recommender systems

to offer personalized suggestions to users. This task hinges upon the recommender system’s ability

to capture the heterogeneous preferences of current users based on their previous product ratings

or product choices and to use such information to predict their liking of new items.

Recommendation systems need to overcome various modeling and computational challenges to

successfully predict preferences and recommend products. Such systems often operate on a sparse

database in which each consumer rates only a few items and each product is rated or chosen by

only a few customers. The paucity of data for most consumers implies that it is critical to borrow

information from other consumers in order to predict the preferences of a given consumer, and

therefore some type of shrinkage mechanism is needed to model customer heterogeneity. The large

number of products within a database also poses challenges in representing these in terms of their

underlying features. In many marketing situations, such feature representations are unavailable,

or are at best partially available, as considerable domain expertise is needed for firms to manually

supply detailed content descriptors for each product. Yet, a rich representation of products in terms

of their attributes is crucial for properly modeling preference heterogeneity. Thus many systems

rely on some sort of automatic feature extraction. In this research, we show how user-generated

content that represents the “voice of the customers” can be leveraged to automatically extract

features that are most predictive of preferences. Finally, recommender systems need to overcome

various cold start problems associated with new users and new items.

Apart from the modeling challenges, typical recommendation contexts generate big data com-

putational challenges stemming from data volume, variety and veracity. While personalization

focuses on a given user or a given product, large data volume that results from a massive user base

and a vast product mix is critical for recommendation success, as it facilitates borrowing of infor-

mation and enriches representation of products. However, this also results in scalability challenges,

particularly so if complex probabilistic representations are needed to fully capture the information

content in the data. Moreover, the application of user-generated content such as online texts and

tags implies a curse of dimensionality, which needs to be tackled via appropriate dimensionality

reduction procedures. Thus, scalable methods that are capable of estimating probabilistic models

involving many latent variables on large datasets of variegated forms are needed.

In this paper, we develop a novel hybrid model-based recommendation framework that ad-

dresses the above representational challenges. In addition, we develop a novel stochastic variational

Bayesian approach for estimation that overcomes the scalability challenges associated with large

volume and high dimensionality. Specifically, we leverage crowd-sourced textual descriptions of

products, such as user-generated tags, to construct probabilistic content descriptors of products.

This alleviates the often onerous requirement for the firm-provided product attributes – an issue

that bedevils most content filtering recommendation systems. We construct a supervised proba-

bilistic topic model that transforms the “voice of the customer” about products and automatically

infers latent product features that not only summarize the semantic content within the tags/words,

but also simultaneously predict product ratings. In our model, the topics capture the semantic

structure of the textual representation of the products and allow automatic dimension reduction of

the vast vocabulary underlying the textual descriptions. More importantly, consumer preferences

are specified in terms of these latent topics in our model. As we jointly model both the textual

data as well as the product ratings, these latent topics are inferred to be the most predictive of

user preferences. In addition, we allow firm specified covariates, when available, to guide the al-

location of products to topics. This results in a recommendation system that leverages preference

heterogeneity over rich user-generated content representations in a seamless manner.

Specifically, we develop a covariate-guided supervised topic model to relate the textual descrip-

tion of products to the user ratings. Our model extends the supervised latent Dirichlet allocation

model (SLDA; Blei and McAuliffe 2007) in several directions to capture the unique characteristics

of the recommendation context. Recommendation datasets often have a cross-nested dependency

structure as a given user rates multiple products and each prodcut is rated by multiple users. In

2

our model each product description (e.g., a document in the topic model) is associated with mul-

tiple product ratings given by many different users. This is distinct from typical supervised topic

models in which each document is rated by a single user – such models are therefore more suitable

for sentiment analysis of reviews, but are not rich enough to represent the preference heterogeneity

that is crucial in making successful recommendations. We also account for preference heterogeneity

over topics and explicitly take into account the cross-nested structure of the data. Finally, we use

firm specified product covariates to guide the allocation of latent topics to products. This allows

us to tackle the cold-start problem associated with new products.

We apply our modeling framework to the context of personalized movie recommendations. We

believe that the experiential nature of movie products is more amenable for using automatic feature

extraction from crowd-sourced customer opinions, especially because standard content descriptors

such as movie genre are not rich enough to flexibily capture the numerous reasons why certain

movies appeal to particular consumers. While we use crowd-sourced tags to generate textual

representations for movies, such representations can also be obtained from other naturally occurring

information sources on the Internet, such as Wikipedia, blogs, tweets, or product reviews.

Our application can also be considered as a quintessential example of big data marketing because

it simultaneously incorporates multiple facets of the 4Vs framework that is used to characterize

big data situations (Sudhir 2016). For instance, our application uses a very large set of users

and products, which results in a large volume of ratings. Also, the model deals with a variety of

data, including unstructured texts and numbers, and we use natural language processing methods

to replace the high-dimensional semantic content of the tags with a small set of latent topics.

Moreover, our application showcases the challenges and opportunities of data veracity, in that data

can be fused together from disparate sources, as the tags and ratings can be gathered from different

sets of customers on various online platforms.

Given the computational demands of our big data setting, we develop a novel stochastic vari-

ational Bayesian (SVB) approach to achieve fast and scalable inference of the proposed model.

Stochastic variational Bayesian methods differ from sampling-based MCMC approaches for sum-

marizing the posterior distribution, and instead use optimization to approximate the posterior.

Thus SVB methods generate estimation results at a fraction of the time needed for traditional

MCMC methods. Our SVB algorithm contains a number of novel computational features, such as

3

the use of stochastic natural gradient descent and adaptive mini-batch sizes to significantly enhance

computational speed and estimation scalability.

In the context of our movie application, we show that our model generates much better predic-

tions than a benchmark model that only uses manually specified genre covariates. This showcases

the benefits that accrue from rich feature representations derived from UGC. We also use our model

to uncover a number of interesting insights about the determinants of movie preferences and the

semantic structure behind the movie tags. We show how the model output can be used for a number

of tasks that are relevant in the functioning of a recommender system. We show how our model can

be used to generate unconditional recommendations for a given user. More interestingly, we show

how our model can support different types of personalized search within the product recommen-

dation context. For example, we show how the model can generate a personalized ranking of a set

of movies that are most similar to a given movie specified by a user. We also show how our model

can generate a set of personalized ranked results for a user query that specifies the user needs via

a list of keywords.

In summary, our research has both methodological and managerial contributions. Methodolog-

ically, our model extends traditional supervised topic models by incorporating a number of features

that are relevant for the recommendation context. We also develop a novel SVB algorithm for

scalable inference of our model. Our SVB algorithm can be applied to other big data marketing

contexts. For instance, it can be modified to accommodate other supervised mixed membership

models or hierarchical models with both conjugate and non-conjugate components. On the man-

agerial front, our model can be used not only for generating insights about the determinants of

consumer preferences, but also for directly recommending products beyond the movie category.

Given that segmentation, targeting and personalization are core marketing activities, our modeling

and estimation approaches are immediately useful for marketing practitioners.

The rest of the paper proceeds as follows. After a literature review, we describe the different

components of our data in Section 3. We then develop the modeling framework for hybrid rec-

ommendation systems in Section 4. After revealing the computational challenges, we introduce

variatonal Bayesian methods in Section 5, and we elaborate on the stochastic natural gradient

strategy to speed up the computation for massive data settings. In Section 6 we present the es-

timation results and the associated managerial insights. Finally we conclude by discussing the

4

limitations of the current model and estimation approaches, and highlight potential directions for

further research.

2 Literature Review

Several research areas in marketing, statistics and machine learning are relevant for our work on

personalized recommendation systems in big data settings. These include the literature on rec-

ommendation systems and the natural language processing literature on probabilistic topic models

and mixed membership models. Moreover, the ongoing research on scalable Bayesian inference in

statistics and computer science is relevant for handling the big data challenges in our application.

We succinctly review these areas below.

A number of studies in the marketing and computer science literatures have developed algo-

rithmic and statistical approaches for generating recommendations. Recommender systems differ

on whether they are model-based or not. Prominent classes of model-based recommender systems

include collaborative filtering, content filtering, and hybrid approaches that use a combination of

collaborative and content filtering.

Collaborative filtering models (see Desrosiers and Karypis 2011 for a review) rely solely on user

ratings or purchase incidence data and leverage the similarity in preferences across users or across

items. These methods therefore do not leverage attribute information about products. In particular,

user-based collaborative filtering identifies those users who are closest to a given user in terms of

their preferences for products in the database. Similarly, item-based collaborative filtering identifies

those products that are closest to a given product in terms of their appeal to customers. More

recent incarnations of collaborative filtering use matrix factorizations (Koren and Bell 2011) of the

user-item ratings matrix to uncover latent factors that represent user preferences or unobservable

product features. These matrix factorization approaches result in automatic summarization and

dimension reduction of the ratings matrix. Despite their advantages, collaborative filtering methods

suffer from the cold-start problem in that they cannot be used for new users or new items.

Content filtering systems (see Lops et al. 2011 for a review), in contrast, use information about

the content of an item to capture the drivers of preferences. Content is broadly defined and often

takes the shape of a set of product features that are used to model the variability in product ratings.

5

Content-based systems can be useful in providing the underlying rationale for a recommendation,

thereby increasing customer trust about the recommendations. Content-based methods have ad-

ditional advantages that they can be used to predict preferences for new items based on their

constituent features. However, explicitly coding a set of features to sufficiently describe an item

can become difficult, especially when dealing with a large number of products, as is typically the

case in the online environment where products are added on a continual basis. Moreover, a com-

plete description of a product could require many attributes, which can add to the difficulty of data

collection considerably, especially if domain experts are needed to specify the relevant attribute

values.

Hybrid recommender systems integrate collaborative and content filtering models to leverage

the best features of both. Ansari, Essegaeir and Kohli (2000) develop such a hybrid hierarchical

Bayesian model to leverage the preference heterogeneity across consumers in making recommen-

dations. In this model, the Bayesian shrinkage arising from the population distribution that char-

acterizes preference heterogeneity automatically allows for model-based collaborative filtering. A

number of marketing scholars have made advances in this area, including Ying et al. (2006), Boda-

pati (2008), Chung, Rust and Wedel (2009) and Chung and Rao (2012). In this paper, we continue

in this tradition, but focus explicitly on leveraging automatic content representation obtained via

probabilistic topic models to predict preferences.

The natural language processing literature on probabilistic topic models for textual data (e.g.,

Blei, Ng and Jordan 2003; Blei and McAuliffe 2007) is also relevant for our recommendation model.

Topic models are mixed membership models that automatically summarize the latent topics char-

acterizing the semantic structure underlying a corpus of documents (Tirunillai and Tellis 2014).

While traditional supervised topic models are suitable for aggregate semantic analysis of textual

data such as reviews, these models often limit one document to be rated by a single author. Hence,

they are not readily suitable for personalized recommendation system where each product may

contain multiple ratings from different users. Our CGSTM model extends supervised topic models

via a richer latent variable specification that captures the dependency structure of recommenda-

tion database. In particular, it allows for multiple ratings from different users for each document

(movie) and for individual differences across users in their preference structure over the topics.

Finally, the statistical and machine learning literature on scalable Bayesian inference is relevant

6

given the big data setting of our application. Bayesian methods (Rossi, Allenby and McCulloch

2005) are particularly suited for recommendation problems, given the need of pooling information

across users in the modeling of heterogeneity, and the need of generating individual-level estimates

of consumer preferences. MCMC methods are popular in summarizing the posterior distribution

of latent variables and parameters, but can be slow in big data contexts due to the need for tens

of thousands of iterations required for convergence (Braun and McAuliffe 2010). We therefore use

variational Bayesian methods (Bishop 2006; Dzyabura and Hauser 2011; Omerod and Wand 2010)

which replace sampling with optimization, thus resulting in significant speed improvements. In

particular, we leverage the state-of-the-art advances in stochastic variational methods (Hoffman et

al. 2013; Tan 2015; Toulis and Airoldi 2016) to significantly ensure the speed and scalability of

inference. We now describe the data context to facilitate an easier understand of our model.

3 Data Description

We use our model on the MovieLens data (Harper and Konstan 2015) for movie recommendations.

Our analysis is based on a dataset that was made available by MovieLens on August 06, 2015.

The data contains 21,622,187 ratings and 516,139 tag applications across 30,106 movies. There are

234,934 users in the data who provided ratings between January 09, 1995 and August 06, 2015.

The data files contain 1) the movie ratings given by users on a 10-point scale ranging from 0 to 5

in 0.5 point increments, 2) textual tags applied to movies by the users, and 3) the title and genre

information for each movie. In the data, users were free to come up with any tags that described

the movies. Not all users in the dataset tagged the movies, so we aggregate all the tags that are

applied to the same movie across users to construct a “bag of tags” description of the movie. Thus,

in using the tags, we ignore the identity of the users who supplied the tag. In addition, the dataset

also describes the movies using a set of 19 genres. Lastly, the dataset does not include any user

demographics.

We randomly select 5,000 movies from the set of 10,722 movies that received tags from at least

four users. We then use a number of preprocessing steps on the textual data associated with these

movies to clean it for analysis. In particular, we convert the tags to lower-case to eliminate any

redundancy in the tags that may arise from lower and upper case versions of the same tag. We

decide against tag stemming to facilitate easy understanding of the topics by readers. We choose not

7

Figure 1: Proportion of Movie Genres

to tokenize multi-worded tags into space-separated words, as the tag as a whole is more meaningful

than the individual words comprising the tags. To reduce vocabulary size to a manageable level, we

also discard all tags that were applied only once in the data. In addition, as our data contains well

formed tags and not free flowing reviews or conversations, there is no need to remove stop-words,

as is typically done in textual preprocessing. These preprocessing steps result in a sample of 4,609

movies that were rated by 111,793 users. The total number of tag applications across all movies is

233,268 and the overall vocabulary size (i.e., the number of distinct tags) is 21,255. Compared to

the 19 genres, this large vocabulary has the potential to be a lot more expressive about the movie

characteristics perceived by the users. The final dataset contains 8,865,061 ratings across the users

and movies.

Now we provide some summary statistics on the data. First, the proportions of the 19 movie

genres in our sample are shown in Figure 1. We can see that Drama, Comedy, Action, Thriller and

Romance are the top most genres represented in the data, whereas Film-Noir is the least represented.

Figure 2 shows a word cloud that reflects the most frequent tags applied to the movies. It is clear

that many of these popular tags do not overlap with the 19 genres. The diversity of the tags seen

8

Figure 2: Word Cloud of Movie Tags

Figure 3: Histogram of the Number of Tags Received by Each Movie

in the word cloud highlights the importance of using crowd-sourced tagging to generate content

attributes beyond traditional standard attributes for describing and recommending products.

Figure 3 provides the histogram of the number of tags received by each movie. The median

number of tags for a movie is 16, with a mean of 50 and a standard deviation of 114.9. It is

interesting to note that the median number of tags is lower than 19, the number of pre-specified

genres within the data. The number of tags that were attached to a movie depends upon the movie’s

popularity and upon the time span for which it was part of the recommender system database.

9

(a) (b)

Figure 4: Scatter Plots for: (a) the Means and Standard Deviations of Ratings at Movie Level; (a)the Means and Standard Deviations of Ratings at User Level

As for the ratings, the mean across all observations of our data is 3.57, with a standard deviation

of 1.03, and the median rating is 4. There is considerable heterogeneity in the number of movies

that individuals rated. The median number of movies that a user rated is 43, and the mean is 79.

In addition, we compute the mean and standard deviation of the ratings received by each movie,

and similarly the mean and standard deviation of the ratings supplied by each individual. Figure

4 shows the scatter plots of these statistics. It is clear from the figures that there is considerable

heterogeneity in the ratings at both the movie and the user levels. The negative correlation that is

evident in both plots implies that the movies with lower mean ratings also exhibit higher variability

in their ratings and similarly, users who on average give low ratings also have higher variability in

their ratings. These observations make sense as movies that are not universally loved or hated (i.e.

cult movies) would appeal differently to different users. Users who tend to be more critical would

analyze the films more carefully, thus giving more nuanced opinions.

4 Model

Our recommendation model applies to a database of standard attributes, textual descriptions and

user ratings for products. We index the products (or documents) by d ∈ {1, . . . , D} and the users

who rate these products by i ∈ {1, . . . , I}. Besides standard product attributes xd, each product

is also described by a textual description, which can be a “bag of words” agglomeration of reviews

10

for a product, or composed from product tags provided by customers. In our application this bag-

of-words is a collection of user-generated tags that describe the features/aspects of the product.

A given product description can include multiple instances of the same tag, as the frequency with

which a given tag occurs within a product description depends upon how many people think the

tag is a relevant descriptor for the product. It is important to note that our use of bag-of-words

description is not restrictive, in that there is no inherent sequential ordering to the tag applications,

unlike in natural text (e.g., reviews) where the semantic meaning of the text depends critically on

the sequence of words in a sentence and in a paragraph.

The textual description associated with product d can be represented by a vector wd =

{wd1, wd2, . . . , wdNd}, where wdn represents the n-th token, or tag application within wd. The

set of unique tags across all product descriptions is indexed by v ∈ {1, . . . , V }, where V is the

vocabulary size, i.e., the total number of unique tags within the database. In addition to the prod-

uct content, the database also contains product ratings {yid} provided by the users as part of the

normal operations of the recommender system. The matrix of product ratings is sparse, as only a

fraction of all products are rated by a given user and a fraction of all users provide ratings to a

given product.

We assume that the information in the textual description for each product can be summa-

rized using a set of k = 1, . . . ,K topics. A topic can be considered as a discrete probability

distribution over the vocabulary. Topic k, therefore, is characterized by the probability vector

τ k = {τk1, . . . , τkV }, where the element τkv indicates the probability with which the tag v occurs

under topic k. The probability vector τ k is assumed to come from a symmetric Dirichlet distribution

Dir(τ k|η), where η > 0 is a scalar concentration parameter. The symmetric Dirichlet distribution

is a special case of Dirichlet distribution where all the elements of the Dirichlet parameter vector

are assumed to be equal. Applying symmetric Dirichlet is appropriate for topic distributions as we

do not possess prior knowledge that favors one token over another.

The K topics differ in the probabilities τkv with which they generate a given tag v. All the

products in a given recommender system share the same set of topics, but each product description

exhibits a mix of topics that is unique to itself, thus yielding a mixed-membership model (Ershoeva,

Blei and Lafferty 2009). The probability with which the K topics are represented within a product

description d are given by the vector of topic proportions ωd = {ωd1, . . . , ωdK}. This vector varies

11

across the products according to another Dirichlet distribution Dir(ωd|αd), where the Dirichlet

parameter αd is product specific and follows a log-normal distribution, i.e., αd ∼ logN (xdν, δ2).

In the MovieLens data, the standard product attributes include the pre-specified 19 movie genres,

each of which is represented as a binary variable. A movie can exhibit multiple genres and a vector

xd of binary variables for each movie. In this way, we allow the standard product attributes such as

movie genres to influence the topic mix in a user-generated product description, and the estimates

of ν and δ2 will indicate the average and variability of such influence.

We also assume that the rating yid that user i gives to product d reflects the user’s preferences

for the product. These ratings can be explained in terms of a parameter vector γi for user i. The

coefficients in γi reflect the extent to which the different latent semantic dimensions (i.e., topics)

matter in explaining the ratings. We assume that the users differ in the preference vectors and

model the heterogeneity in preferences via a normal population distribution, γi ∼ N (β,Γ), where β

represents the population mean and Γ represents the population covariance matrix. The diagonal

elements of the covariance matrix captures the variability in the preference parameters and its

off-diagonal elements reflect how the preferences for different latent topics covary across users.

We fuse together the cross-nested tag applications and ratings to jointly uncover the latent

topics that best predict the product ratings and the preference parameters that underlie these

ratings. As the data contains multiple ratings from each user, we are able to properly account

for sources of unobserved user heterogeneity, which is critical for capturing user preferences in

terms of the latent topics. Given the above description, our model can be specified using the

following generative process. Fixing the number of topicsK, the Dirichlet parameter η, the Dirichlet

regression parameters ν and δ2, the population parameters β and Γ, and the error variance σ2, the

CGSTM model generates product descriptions and their associated ratings as follows:

1. Draw topic distribution for each topic, τ k ∼ Dir(η).

2. Draw the Dirichlet parameter αd for each product, αd ∼ logN (xdν, δ2).

3. Draw topic proportions for each product, ωd ∼ Dir(αd).

4. For each tag application n belonging to product description d,

a) Draw topic zdn ∼ Multinomial(ωd).

12

, Γ

Figure 5: Directed Acyclic Graph for The Cross-Nested Topic Model

b) Draw tag wdn ∼ Multinomial(τ zdn).

5. For each user i who rated product d,

a) Draw his/her preference parameters γi ∼ N (β,Γ).

b) Draw the rating yid ∼ N (z′dγi, σ2), where zd = (1/Nd)

∑Ndn=1 zdn.

Figure 5 shows the directed acyclic graph for the CGSTM model. The model structure differs

from the hybrid models in the recommender system literature in that the content variables are

specified through the unobserved empirical frequencies of the latent topics within each product

description. By relating the ratings to the mean unobserved frequencies zd, rather than on the

topic proportions ωd, we ensure that the ratings are determined by the topic frequencies that

actually underlie the bag of tags description of the product. Such an approach is likely to yield

latent topics that not only capture the semantic content of the tags, but are also most predictive of

the user ratings and reflective of the standard product attributes. Therefore, this approach has the

potential to improve the predictive ability of the model – the central objective of most recommender

systems.

13

5 Posterior Inference via Stochastic Variational Bayes

The full posterior distribution of our attribute-guided cross-nested supervised topic model is written

as

p(ω1:D, z1:D, τ 1:K ,γ1:I ,β,Γ, σ2, α1:D,ν, δ

2 | {wdn}, {yid},x1:D) ∝

p(β)p(Γ)p(σ2)p(ν)p(δ2)D∏d=1

p(αd|xd,ν, δ2)p(ωd|αd)Nd∏n=1

p(zdn|ωd)K∏k=1

p(τ k|η)

× p(wdn|zdn, τ 1:K)I∏i=1

p(γi|β,Γ)∏d∈Di

p(yid|γi, zd, σ2),

(1)

where Di denotes the movies rated by user i. As the normalizing constant cannot be computed in

closed form, the posterior distribution is not available analytically, and we therefore need approxi-

mate methods of inference.

Marketers have traditionally used MCMC methods for summarizing the posterior distribution.

MCMC methods involve iteratively sampling parameter values from the posterior distribution and

inference is then based on the sample of correlated draws. MCMC methods such as Gibbs sampling

and the Metropolis-Hastings algorithm typically require many tens of thousands of draws from the

posterior. In big data contexts such as ours, where we have a large volume, characterized by the

millions of movie ratings, and high dimensionality of the data resulting from the massive number

of unique tags, MCMC methods are computationally intensive and take a large amount of time to

converge. We therefore use stochastic variational Bayesian methods to approximate the posterior

distribution. As ours is the first application of stochastic variational Bayesian methods in marketing,

we briefly review the methods before deriving a specific instantiation for our model.

5.1 Variational Bayesian Inference

Suppose p(θ|y) represents a posterior for a generic Bayesian model, p(y,θ) is the joint distribution

and p(y) is the normalizing constant. Variational Bayes (VB) methods approximate the intractable

posterior p(θ|y) with a simpler approximating distribution q(θ|λ), called the variational distribu-

tion (Jordan et al. 1999; Bishop 2006; Ormerod and Wand 2010) that belongs to a distribution

family which is indexed by a set of variational parameters λ. In VB inference, we search over the

space of variational distributions to find a member that is closest to the posterior distribution based

14

on the Kullback-Leibler (KL) divergence (Kullback and Leibler 1951).

5.1.1 Kullback-Leibler Divergence and ELBO

The KL divergence between the approximating distribution q(θ) and the posterior p(θ|y) is defined

as

KL [q(θ)||p(θ|y)] =

∫q(θ) log

q(θ)

p(θ|y)dθ ≥ 0, (2)

where the equality holds if and only if q(θ) = p(θ|y). We can rewrite the KL divergence as

KL [q(θ)||p(θ|y)] = Eq[log q(θ)]− Eq[log p(θ|y)] = Eq[log q(θ)]− Eq[log p(y,θ)] + log p(y) (3)

where Eq[·] refers to taking expectations with respect to the variational distribution. As the last

term log p(y) involves the unknown normalizing constant, the KL is not directly computable. How-

ever, we can optimize an alternative objective function called the evidence lower bound (ELBO),

which differs from the negative of the KL divergence up to a constant. The ELBO is written as

ELBO(q) = Eq[log p(y,θ)]− Eq[log q(θ)]. (4)

Minimizing the KL divergence is therefore equivalent to maximizing the ELBO (against λ).

Because we do not know the posterior p(θ|y) to begin with, we must place restrictions on the

approximating variational distribution for the optimization to proceed. These restrictions structure

the approximating distribution such that its functional form is either inferred automatically from

the model structure, or is explicitly set by the analyst via the choice of a specific parametric family

of distributions. The choice of the restrictions represent the trade-off between the tractability

and the richness of the approximation. Mean-field restrictions are popular, especially in handling

conjugate models, whereas, fixed-form approximations are needed for non-conjugate setups. In this

paper we develop an adaptive fixed-form method for efficient inference in non-conjugate models

and show how it can be used in tandem with the structured mean-field approximation to yield a

novel hybrid framework for estimating hierarchical models such as ours. We give enough details in

the main body of the paper for the presentation to be self-sufficient and relegate technical aspects

to the Appendix and Ansari, Li and Zhang (2016).

15

5.1.2 Hybrid Variational Bayes

In our model, we assume a structured mean-field approximation that mimics the dependency struc-

ture within the joint distribution, and specify the variational distribution as

q(ω1:D, z1:D, τ 1:K ,γ1:I ,β,Γ, σ2, α1:D,ν, δ

2) = q(β) q(Γ) q(σ2) q(ν) q(δ2)

×D∏d=1

q(αd)q(ωd)

Nd∏n=1

q(zdn)K∏k=1

q(τ k)I∏i=1

q(γi).(5)

We use semi-conjugate priors for the population parameters. Specifically, we assume multivariate

normal priors for β and for ν, an inverse-Wishart prior for Γ, and inverse-gamma priors for σ2

and δ2. Except for the Dirichlet parameter αd, the rest of the data generative process entails a

semi-conjugate setup, thus we can derive closed-form expressions for the variational components:

q(ωd) = Dir(ζd), q(zdn) = Multinomial(φdn), q(τ k) = Dir(ιk), q(γi) = N (µq(γi),Σq(γi)), q(β) =

N (µq(β),Σq(β)), q(Γ) = IW(ρq(Γ),Rq(Γ)), q(σ2) = IG(aq(σ2), bq(σ2)), q(ν) = N (µq(ν),Σq(ν)) and

q(δ2) = IG(aq(δ2), bq(δ2)).

For the non-conjugate components α1:D, we use a novel stochastic approximation approach,

which we call Adaptive Doubly Stochastic Variational Bayes (ADSVB). ADSVB is an adaptive

extension of the doubly stochastic variational Bayesian (DSVB) method proposed recently by Tit-

sias and Lazaro-Gredilla (2014) to approximate the posterior of a non-conjugate model, which

involves a reparametrization (Rezende et al. 2014; Kingma and Welling 2014) of the non-conjugate

component such that a simple stochastic gradient algorithm to update the variational parameters.

In particular, we use Monte Carlo draws from the standard normal to calculate the expectations

involved in computing the gradient, thus yielding stochastic gradients. The stochastic gradient

ascent procedure converges to a local maximum with proper (decreasing) learning rate (Robbins

and Monro 1951). The convergence speed and the stability of stochastic gradient ascent depend on

the learning rate schedule of the stochastic gradient algorithm. We therefore use adaptive step sizes

following the ADAM method (Kingma and Ba 2015) for stable optimization. The integration of

ADAM with DSVB results in a novel adaptive stochastic optimization scheme — ADSVB, which is

numerically stable yet computationally efficient, and is well suited for big data settings (see Ansari

et al. 2016, for additional details regarding hybrid variational Bayes and ADSVB).

Combining the structured mean-field approximation with the ADSVB scheme, we can now iter-

16

atively update the variational parameters via coordinate ascent. The resulting estimation process

involve using ADSVB to update α1:D, which serves as an inner loop within an outer loop that

updates for all other (conjugate) parameters using coordinate ascent. To speed up inference in the

big data setting, next we develop a stochastic variational Bayesian procedure.

5.2 Speeding Up Variational Bayes

When fitting a complex model with many individual level latent parameters to a big data set,

the coordinate-ascent procedure requires significant computation because of the need to iteratively

update each latent variable. In particular, we need to update the latent variables associated with

each user in each iteration. This becomes a computational bottleneck if the data contains a large

number of individuals. Recent research has explored strategies to speed up this computation via

stochastic variational inference (Hoffman et al. 2013).

5.2.1 Stochastic Optimization with Adaptive Mini-batch

Recall that in variational Bayesian estimation, our goal is to maximize the ELBO defined in (4).

In the proposed CGSTM model, the ELBO for movie d has the form:

ELBOd = E[log p(ωd|αd)] +

Nd∑n=1

E[log p(zdn|ωd)] +

Nd∑n=1

E[log p(wdn|zdn, τ 1:K)]

+

Id∑i=1

E[log p(yid)|zd,γi, σ2] +

Id∑i=1

E[log p(γi|β,Γ)] +

K∑k=1

E[log p(τ k|η)] +H(q),

(6)

where the expectation is taken with respect to the corresponding variational distributions, Iddenotes the number of individuals who rated movie d, and the last term H(q) is the entropy over

the entire set of variational distributions in the model. In each iteration of optimizing ELBO, we

take the gradient of the objective function, from which we move the ELBO towards the direction of

the steepest ascent. In the computation, in every iteration we need to go through each individual

to update their specific coefficients and then form the gradient of the ELBO. When there are a

massive number of individuals in the data, a typical big data situation in marketing, this process

substantially slows down the computation. Given that the ELBO and its gradient both involve

a sum of individual terms that are independent conditional on the population level parameters,

the theory on stochastic optimization (Robins and Monro 1951; Spall 2003) allows us to save on

17

computation by only going through a random subsample of individuals in each iteration to obtain

a noisy gradient to update the objective. Because such a cheaply computed gradient is unbiased

across iterations, the objective function will probabilistically converge to an optimum under proper

regulatory conditions (Hoffman et al. 2013).

Specifically, in iteration t, we randomly sample a subset of individuals (without replacement) —

a mini-batch of the data, S(t), and update only the γi’s corresponding to these individuals. Based

on the updated individual parameters we then construct a stochastic gradient to update the other

parameters at the population level (invariant across individuals) in (6). Because the optimization

process is recursive, i.e., updating one parameter depends on the current values of all other parame-

ters, the updating of the individual parameters in the mini-batch is also a function of the population

parameters. At the early iterations, the population parameters are still far from optimality, so it

is wasteful to take large mini-batch because updating the individual parameters in the mini-batch

has low precision anyway, due to the unconverged population parameters. In fact, we can further

save on computation by sampling smaller mini-batch (i.e., fewer individuals) at the beginning of

the optimization process. From that we quickly and stochastically move the estimation towards

the proper direction, and increase the size of the mini-batch (i.e., sampling more individuals) in

later iterations only if more precision is needed for converging on an optimum. Essentially, we have

an adaptive procedure that automatically determines the most appropriate batch size to use in a

given iteration, leading to significant improvement in the speed and scalability of the already fast

variational Bayesian estimation. In this strategy, it is important to have a rule on when and how

to increase mini-batch size during the process. In current paper we adopt the “ratio of path and

progress” criterion (RPP; Gaivoronski 1988; Tan 2015) and let the estimation procedure determine

by itself when to sample more individuals into the mini-batch and by how many. Appendix A

provides the necessary technical details for implementing this adaptation strategy for stochastic

VB. Appendix B outlines the full procedure of stochastic variational Bayesian estimation for the

main model.

18

6 Results

6.1 Null Models and Holdout Data

We compare our cross-nested topic model to a hierarchical Bayesian model that uses the pre-

specified 19 genres as covariates. As each movie is represented in terms of multiple genres, we

have a set of 19 genre-specific coefficients and an intercept for each user. We assume that the

individual level coefficients come from a multivariate normal population distribution with a full

covariance structure. A comparison between the null and the proposed models can help us assess

the predictive benefits that arise from the greater expressive power of the crowd-sourced semantic

representations inherent in the latent topics. We estimate both models using stochastic variational

Bayesian methods. With a convergence criterion of 10−6 on the ELBO, the variational Bayesian

estimation on the null model finishes at 10 iterations with 585.9 seconds (0.16 hours), whereas the

MCMC estimation of the same model with 5000 runs takes 21.1 hours to complete1. Also, the

(mean) parameter estimates between VB and MCMC are virtually indistinguishable2. Given the

SVB significantly outperforms MCMC in computational time, when estimating the main model

that is far more complex than the null, we do not use MCMC because we expect that MCMC

methods will not be competitive in terms of the computational time needed for estimation. Our

model contains a very large number of latent variables, including the multinomial topic indicators

for each of the 233,268 tag applications, and multivariate user-level coefficients for each of the

111,793 users. The use of data augmentation to sample these latent variables is likely to be time

consuming, especially given that MCMC typically takes a much larger number of iterations to

converge, when compared to VB.

To evaluate the predictive performance of the genre-only model and the CGSTM model, we

split our dataset into calibration and holdout samples. We estimate both models on the calibration

data and make predictions on both datasets. To form the holdout data, we set aside 8 movies per

individual. This resulted in a total of 7,970,717 ratings in the calibration data and 894,344 ratings

in the holdout data. The movies and users in either dataset are the same as in the full sample.

1For a fair comparison we code both VB and MCMC in Mathematica 10 and use the just-in-time compilationcapability of Mathematica to compile the programs to C. We run both programs on a computer with 3GHz 8-CoreIntel Xeon E5 processor and 32G of RAM.

2The underestimation of parameter uncertainty is still present under the mean-field VB estimation, albeit the useof structured mean-field that mitigates such effect. However, for the purposes of model prediction and the productrecommendation, it is the mean estimates of model parameters that matter.

19

We estimate multiple versions of the proposed model that differ on the number of topics K.

Based on model fit and predictive performance reported below, we settle on a model with 20

topics and the results presented here are all obtained from this version. We estimate our model

using both deterministic and stochastic variational Bayesian methods. The deterministic VB takes

18,930 seconds (5.3 hours) and 275 iterations to converge, whereas the stochastic VB with adaptive

mini-batch sizes takes only 4,667 seconds (1.3 hours) and 150 iterations to finish. Convergence is

declared when the joint Euclidean norm on the population level parameters changes by less than

10−4 between iterations. The substantial difference in computational times once again highlights

the scalability benefits of stochastic variational inference in big data settings. Details of our SVB

approach are outlined in Appendix B. As the actual estimates do not vary across the two estimation

schemes, we now report the results from the SVB approach. We begin with the qualitative insights

that can be gleaned from our model and discuss predictive performance and model fit subsequently.

6.2 Recovery of Topics

Topic Distributions The model yields topic distributions that are most predictive of the ratings.

Table 1 shows the top 8 tags within each of the 20 topics. These topics are arranged in ascending

order of the population mean coefficients β on the topics. Thus the most important topics are

shown at the bottom of Table 1. It is interesting to see how certain tags naturally congregate

to form meaningful topics. For example, Topic 18 is about American animation movies, often

produced by Pixar and Disney, for children and are usually fairy-tales and romance-themed. Topic

12 is about Japanese animation movies, which is predominantly made by Mr. Miyazaki of Studio

Ghibli, whose style is quite different from those of the “family-fun” Disney or Pixar productions.

His films tend to have story lines and characters that are fantastic, surreal and dream-like in nature

(e.g. Spirited Away, Howl’s Moving Castle, to name a few), perhaps even eliciting feelings of drug-

induced “hallucination” for certain users. Topic 10 appears to be closely related to the director

Quentin Tarantino, who uses action and violence as overarching styles in all of his firms, and has

directed works related to WWII (The Inglorious Bastards, starring Brad Pitt), martial art (Kill

Bill), and revenge (Kill Bill, Pulp Fiction). It is clear that the topics richly depict the semantic

information pertaining to the theme, provenance, popularity, awards, and actors of a movie. Thus

they capture a much greater semantic terrain than what is possible using a set of genre dummies.

20

Tab

le1:

Top

Eig

ht

Tag

sA

ssoci

ated

wit

hE

ach

Top

ic

Top

icT

op

8W

ord

sW

ith

inE

ach

Top

ic

1b

ase

don

ab

ook

nu

dit

y(t

op

less

)fr

an

chis

ere

make

nata

lie

port

man

act

ion

sequ

elst

up

id2

pre

dic

tab

leb

ori

ng

ad

am

san

dle

ract

ing

silly

plo

tst

ory

bad

plo

t3

sport

snu

dit

y(t

op

less

)b

d-r

bet

am

ax

boxin

gca

rsp

olice

sylv

este

r-st

allon

e4

zom

bie

sh

orr

or

dis

turb

ing

seri

al

kille

rcu

ltfi

lmgore

cree

py

cam

py

5jo

hn

ny

dep

pti

mb

urt

on

mu

sica

lb

ase

don

ab

ook

goth

icvis

ually

ap

pea

lin

ggh

ost

sad

ap

ted

:book

6atm

osp

her

iccr

iter

ion

nu

dit

y(f

ull

fronta

l)b

ase

don

ab

ook

dis

turb

ing

slow

stan

ley

ku

bri

ckm

enta

lilln

ess

7qu

irky

cin

emato

gra

phy

hig

hsc

hool

scarl

ett

joh

an

sson

rela

tion

ship

sco

min

gof

age

ph

ilip

hoff

man

suic

ide

8re

ligio

nle

ssth

an

300

rati

ngs

gay

chri

stia

nit

yle

sbia

nqu

eer

jesu

sb

riti

sh9

sup

erh

ero

act

ion

com

icb

ook

jam

esb

on

dm

arv

eln

ewyork

city

will

smit

hfr

an

chis

e10

qu

enti

nta

ranti

no

worl

dw

ar

iire

ven

ge

mart

ial

art

svio

len

ceb

rad

pit

tact

ion

wes

tern

11

politi

csso

cial

com

men

tary

docu

men

tary

tru

est

ory

mu

sici

an

sm

usi

ch

isto

ryco

rru

pti

on

12

an

ime

surr

eal

dru

gs

styli

zed

stu

dio

gh

ibli

jap

an

hayao

miy

aza

ki

hallu

cin

ato

ry13

thou

ght-

pro

vokin

gsu

rrea

lti

me

travel

sci-

fin

on

lin

ear

min

dfu

ckvis

ually

ap

pea

lin

gco

mp

lica

ted

14

com

edy

fun

ny

paro

dy

hilari

ou

sw

itty

sati

reb

ill

murr

ay

hig

hsc

hool

15

sci-

fid

yst

op

iaact

ion

space

post

-ap

oca

lyp

tic

alien

sart

ifici

al

inte

llig

ence

ad

ven

ture

16

twis

ten

din

gth

riller

base

don

ab

ook

myst

ery

psy

cholo

gic

al

psy

cholo

gy

chri

stia

nb

ale

morg

an

-fre

eman

17

class

icafi

bla

ckan

dw

hit

eb

d-r

clv

noir

thri

ller

osc

ar

(bes

tp

ictu

re)

esp

ion

age

18

fanta

syan

imati

on

pix

ar

dis

ney

ad

ven

ture

fair

yta

lero

man

cech

ild

ren

19

dra

ma

rom

an

ceb

ase

don

atr

ue

story

tom

han

ks

tru

est

ory

his

tori

cal

osc

ar

(bes

tp

ictu

re)

insp

irati

on

al

20

atm

osp

her

icd

ark

com

edy

stylize

dqu

irky

bla

ckco

med

yim

db

top

250

vio

len

cecu

ltfi

lm

Tab

le2:

Pop

ula

tion

Coeffi

cien

tsan

dV

aria

nce

sA

ssoci

ated

wit

hth

eT

opic

s

Topic

12

34

56

78

910

11

12

13

14

15

16

17

18

19

20

β1.0

22.4

12.8

12.9

93.1

63.2

53.3

53.4

83.5

33.5

43.5

93.6

33.6

63.6

93.9

94.1

04.1

44.1

84.2

24.5

9D

iag(Γ

)1.9

70.8

80.8

01.1

10.5

61.8

70.9

00.6

80.4

30.5

00.4

81.2

30.4

30.5

40.4

80.5

60.4

80.6

30.5

80.7

8

21

Table 2 reports the population mean coefficient and the population variance associated with each

topic. The table shows that movies associated with the last few topics tend to have high ratings,

on average. We also see that the heterogeneity associated with the individual level coefficients γi

varies across the topics. The large magnitude of these cross-user variances indicates that users

exhibit considerable heterogeneity in their sensitivities on different topics.

This result exhibits face validity as higher numbered topics tend to contain characteristics that

most users would consider to be desirable. For example, Topic 19 contains “Oscar”, “true story”

and “inspirational”, and Topic 20 includes “IMDB top 250” and semantics related to dark humor,

which often sets a high requirement for screenplay excellence. All of these qualities are considered

desirable by most people, thus resulting in higher mean and smaller standard deviation. On the

other hand, lower numbered topics tend to contain more negative and polarizing semantics. For

instance, for Topics 1 and 2, although there exist fans of silly, sequel, or Adam Sandler movies,

many others might consider them as undesirable, as evidenced by tags such as “predictable” and

“boring”.

Top Movies Associated with a Topic Apart from the above topic descriptions, we can also

identify the top movies associated with each topic using the topic proportions ωd for movies.

Specifically, for a given topic k, we find movies that have large values for the proportions ωdk. Due

to the limits stemming from page size, Table 3 shows only the top three movies associated with

each of the 20 topics.

It can be seen by juxtaposing Tables 1 and 3 that the identified top movies are consistent with

the semantic content of the topic. For example, the top movies associated with Topic 9 include

different sequels of “Iron Man” and “Spider-Man”, which is highly consistent with the top 8 tags

for that topic in Table 1 that describes the topic as films based on action super heroes from the

franchise Marvel Comics. Another example of top movies that emerge from the model is under Topic

19, which include dramas based on inspirational historical backgrounds that have won Oscars (i.e.

Forrest Gump, Titanic and The King’s Speech). Movie enthusiasts would also note that, although

the top movies associated with Topic 20 are also critically acclaimed, they are of a different flavor

compared to those of Topic 19. For example, Black Swan, Pulp Fiction and Wristcutters’ successes

22

Table 3: Top Movies Associated with Each Topic

Topic Top 3 Movies1 Simon Says (2009) House of the Dead, The (2003) Date Movie (2006)2 Skyline (2010) Snow White and the Huntsman (2012) You Don’t Mess with the Zohan (2008)3 Rocky IV (1985) Rocky III (1982) Observe and Report (2009)4 Day of the Dead (1985) Hostel (2005) Martyrs (2008)5 Alice in Wonderland (2010) Mamma Mia! (2008) Dark Shadows (2012)6 Persona (1966) Mirror, The (Zerkalo) (1975) Antichrist (2009)7 Vicky Cristina Barcelona (2008) Darjeeling Limited, The (2007) August: Osage County (2013)8 Passion of the Christ, The (2004) Shelter (2007) C.R.A.Z.Y. (2005)9 Iron Man 2 (2010) Iron Man 3 (2013) Spider-Man (2002)10 Ong-Bak: The Thai Warrior (2003) Ip Man (2008) Kill Bill: Vol. 1 (2003)11 Food, Inc. (2008) This Film Is Not Yet Rated (2006) Hustle & Flow (2005)12 Holy Mountain, (1973) FLCL (2000) Fear and Loathing in Las Vegas (1998)13 Primer (2004) Timecrimes (2007) Mr. Nobody (2009)14 Superbad (2007) Anchorman: The Legend of Ron Burgundy (2004) Old School (2003)15 Edge of Tomorrow (2014) Mad Max: Fury Road (2015) StarWars: Revenge of the Sith (2005)16 Dark Knight Rises, The (2012) Shutter Island (2010) Prestige, The (2006)17 Maltese Falcon, The (1941) Rear Window (1954) Casablanca (1942)18 How to Train Your Dragon (2010) Up (2009) Beauty and the Beast (1991)19 Forrest Gump (1994) Titanic (1997) King’s Speech, The (2010)20 Pulp Fiction (1994) Black Swan (2010) Wristcutters: A Love Story (2006)

are based more on quirky script instead of historical grandeur or high production value.

Topic Proportions for a Given Movie Focusing on a given movie and using the topic propor-

tions ωd for this movie, we can study how the “bag of tags” representation draws from the different

topics. Figure 6 shows the topic proportions for “Forrest Gump”. We see that this movie draws

heavily from Topic 19, followed in distance by Topics 20 and 14. A look at the tags associated

with these topic in Table 1 reveals the striking relevance in describing “Forrest Gump”, an Oscar-

winning romantic and inspirational drama starring Tom Hanks, set in historical America from the

1950’s to present day (Topic 19), and with the quirky protagonist Forrest who engages in behaviors

and dialogue that are funny in a satirical way (Topic 14 and 20).

This example shows that our model can 1) flexibly allow the movie to be described by multiple

topics, and 2) allow for relative rankings of the topics related to a movie that are interesting

and informative. For instance, although “Forrest Gump” has some comical characteristics found

in Topics 14 and 20, most users would appreciate it mostly fondly for its inspirational features

described by Topic 19. Just because a movie is most related to a topic, it doesn’t mean it is

married entirely to that topic. This flexible feature of our recommender system allows for a multi-

topic, rank-ordered descriptor that offers richness beyond just a single topic.

23

Figure 6: Topic Proportions for Forrest Gump

We see from the above that our model is capable of generating deep qualitative insights about

the underlying drivers for movie preference. Such information is highly valuable in a recommender

system as it can be used to explain why a particular product is being recommended. We now move

on to show how the model can be used for recommending products to users and how it does in

predicting preferences. We begin by showing how our model can be used to identify movies that

are similar to a given movie.

Movies Similar to a Given Movie Item-based collaborative filtering algorithms identify those

products that are closest to a given product in their appeal to customers. This is usually done

using solely the ratings matrix, and therefore, this approach suffers from the inability to explain

fully why a particular recommendation is being made.

Our model can be leveraged to compute distances between movies based on their topic pro-

portion vectors. While many different distance metrics can be used for this task, here we use the

Hellinger distance (Nikulin 2001) to compute the similarity between two movies, d and d′, based

on their topic proportions, ωd and ωd′ , respectively. The Hellinger distance is defined as

H(d, d′) =

√√√√1

2

K∑k=1

(√ωdk −

√ωd′k)2, (7)

and satisfies the triangle inequality. The use of the topic proportion vectors implies that both ratings

as well as textual content are utilized in computing closeness between movies, given that the topic

distribution is inferred taking into account both the tag and the rating information. Therefore this

is different from solely relying on movie ratings or on the content of movies in computing similarity.

24

Table 4: Five Most Similar Movies for A Given MovieMovie Pulp Fiction The Dark Knight (2008) The Lord of the Rings (2003)

Inglourious Basterds Leon: The Professional The Lord of the Rings (2001)Similar Kill Bill 1 The Dark Night Rises (2012) WALL·EMovies Kill Bill 2 Lucky Number Slevin The Hobbit

Lock, Stock & Two Smoking Barrels The Prestige Star Wars 5Django Unchained The Game Inside Out

The ability to identify nearest neighbors in content space, while simultaneously taking into account

the preferences of users, is highly beneficial in the day-to-day operations of a recommender system.

For instance, this gives the system the capability of suggesting additional movies that are similar

to the movie a user queries about.

Using the Hellinger distance, Table 4 illustrates three examples of five most similar movies to

a given movie. It is interesting to see that the movies deemed similar to “Pulp Fiction” all feature

the directorial talent of Quentin Tarantino. Movies most similar to “The Dark Knight (2008)”,

in addition to its obvious sequel in 2012, tend to be thrillers with complicated plots that have

elements of suspense. Finally, movies that would be good (non-personalized) recommendations for

those who enjoy “The Lord of The Rings (2003)”, tend to be sci-fi and fantasy films that have large

production values. We can see these results on movie similarity that emerge from the model indeed

exhibit high face validity.

6.3 Predictive Performance

We now report how our model performs in predicting the ratings. Table 5 presents the predictive

performance measures for the Null model and for the different versions of the proposed model, on

both the calibration and the holdout datasets.

We calculate the mean absolute deviation (MAD), root mean squared error (RMSE) and the

correlation between the actual and the predictive ratings. In addition, the table also reports the

predictive R2 for these models. The column “Genre-Only” refers to the Null model. The remaining

columns refer to the variants of the proposed cross-nested topic model. These versions differ in the

number of topics, e.g., CGSTM10 indicating a 10-topic model. From the comparison we can see

that all the different versions of the cross-nested SLDA model significantly outperform the Genre-

25

Table 5: Measures of Predictive PerformanceGenre-Only CGSTM10 CGSTM20 CGSTM30 CGSTM40 CGSTM50

MAD 0.68 0.60 0.56 0.54 0.54 0.53Calibration RMSE 0.87 0.78 0.73 0.71 0.70 0.70

Sample Correlation 0.55 0.66 0.71 0.73 0.74 0.74Predictive R2 0.29 0.43 0.50 0.53 0.54 0.54

MAD 0.74 0.67 0.65 0.65 0.65 0.65Holdout RMSE 0.95 0.87 0.85 0.84 0.84 0.84Sample Correlation 0.41 0.54 0.58 0.59 0.59 0.59

Predictive R2 0.17 0.29 0.34 0.34 0.34 0.34

Only model, even though the Null is fairly sophisticated in its treatment of user heterogeneity. It is

interesting to note that even a cross-nested SLDA model with just ten topics does better than the

null model with 19 genres. Of all the variants, the 20-topic version offers the best trade off between

model complexity and predictive performance. Adding more topics does not result in significant

improvement in predictions, but increases the overall computational time. Therefore, throughout

this section we report results from the 20-topic model only.

Table 6: Predicted vs. Actual Ratings

Calibration Sample

Genre-Only Model CGSTM20 Model

Actual Rating Actual Rating

Low Medium High Low Medium HighLow 0.437 0.374 0.189 0.550 0.363 0.087

Predicted Rating Medium 0.126 0.368 0.506 0.094 0.392 0.513High 0.083 0.135 0.782 0.007 0.072 0.921

Holdout Sample

Genre-Only Model CGSTM20 Model

Actual Rating Actual Rating

Low Medium High Low Medium HighLow 0.307 0.385 0.307 0.395 0.412 0.193

Predicted Rating Medium 0.109 0.316 0.574 0.088 0.326 0.586High 0.111 0.155 0.734 0.021 0.099 0.880

Table 6 shows how we can use the predicted ratings from our model to make movie recom-

mendations. Users of recommender system are interested in identifying good movies of their taste,

thus suggestions from our model should correspond well with the good movies (i.e., those with high

ratings) in the database. To assess this aspect, we divide the observations within our calibration

26

and holdout samples into three groups (i.e., Low, Medium and High) based on the 1/3 and 2/3

percentiles of the predicted ratings. We then compute the proportions of observations within each

of these three groups that have Low (actual ratings from zero to 2), Medium (actual ratings from

2.5 to 3.5), and High (actual rating from 4 to 5) ratings. For example, for the CGSTM20 model,

the entry 0.921 of the High-High cell for the calibration sample indicates that 92.1% of the obser-

vations that the model predicts to have high ratings indeed have true ratings between 4 and 5. It is

clear from the table that our proposed model predicts much better than the “Genre-Only” model

in each of the three groups, on both the calibration and the holdout samples. We also see that the

predictions in the holdout sample are slightly worse than the calibration sample, as expected.

6.4 Movie Recommendations

We now focus on how our model can be used in different ways for recommending movies to users.

We first show how we can make unconditional recommendations for a given user. We then show

how the model also supports personalized search based on user queries. We also provide a succinct

interpretation of these recommendations.

6.4.1 Unconditional Recommendations without User Query

Use three randomly chosen users from the data as an example, we show in Table 7 the ten most

preferred movies for the three users. These recommendations are unconditional as they are not

predicated on any additional user input or user query. As we can see, the three users have very

different set of movie recommendations, reflecting their different tastes, and thus highlighting the

importance of preference heterogeneity in our recommender system. One can see that User 1 prefers

visually-stunning, large-production action films. User 2, instead of large Hollywood productions,

prefers movies with quirky story lines and leading characters with unusual logic, and appears to be

a big fan of director Quentin Tarantino’s films. User 3 clearly prefers classic films.

Remember that our model accounts for topic mix of a given movie and textual description

of every film. Therefore, besides the unconditional recommendations above, the model can also

support personalized search and can generate personalized rankings of movies conditional on in-

dividual queries involving movies titles or movie keywords. The personalized search and ranking

27

Table 7: Top Recommended Movies for the Three Users

User 1 User 2 User 3

Transformers: Revenge of the Fallen (2009) Pulp Fiction (1994) Casablanca (1942)Armageddon (1998) Grand Budapest Hotel, The (2014) Rear Window (1954)Pearl Harbor (2001) Inglourious Basterds (2009) Citizen Kane (1941)

Con Air (1997) City of God (2002) Sunset Blvd. (1950)Star Wars: The Phantom Menace (1999) Django Unchained (2012) Third Man, The (1949)Transformers: Dark of the Moon (2011) American Beauty (1999) Whiplash (2014)

Transformers (2007) Office Space (1999) City Lights (1931)Star Wars: Attack of the Clones (2002) Kill Bill: Vol. 1 (2003) Double Indemnity (1944)

G.I. Joe: The Rise of Cobra (2009) Snatch (2000) Maltese Falcon, The (1941)Rocky IV (1985) Lock, Stock & Two Smoking Barrels (1998) North by Northwest (1959)

with additional user input is practically very useful in that users may express different interests at

different times, even though their latent movie preferences stay the same. Thus the context of the

search or browsing can be taken into account to improve the recommendation quality.

6.4.2 Movie-based Personalized Search

When a user actively searches for movies that are similar to a given movie, either by explicitly

typing in the name of the movie or by browsing the description of the movie, we can leverage this

extra information to obtain personalized rankings of movies that are most similar to the searched

movie. For instance, with a search of the movie “Pulp Fiction”, we can use the Hellinger distance

to figure out the most similar movies. The top 10 movies according to such population preference

are: “Inglourious Basterds (2009)”, “Kill Bill 1 (2003)”, “Kill Bill 2 (2004)”, “Lock, Stock & Two

Smoking Barrels (1998)”, “Django Unchained (2012)”, “Drive (2011)”, “Sin City (2005)”, “Once

Upon a Time in the West (1968)”, “Snatch (2000)”, and “From Dusk Till Dawn (1996)”. This

forms the consideration set of relevant movies. We can use each movie’s topic proportions and the

estimated individual specific coefficients in γi to predict the individual’s preference ranking within

this relevant set.

Continuing with the three users in the above example, their personalized ranking and top

recommended movies, conditional on their search of “Pulp Fiction”, are shown in Table 8. We

can see that out of all the ten considered films, “From Dusk Till Dawn” – a vampire-killing film,

perhaps offers the most amount of non-stop action, which fits the preference of User 1. For User

2, the top recommendation is “Inglourious Basterds (2009)” which is a Tarantino’s film. User 3’s

top recommendation is “Once Upon a Time in the West (1968)”, the oldest and most classic film

28

Table 8: Top Recommended Movies for the Three Users with Search of a Movie

User 1 User 2 User 3

From Dusk Till Dawn (1996) Inglourious Basterds (2009) Once Upon a Time in the West (1968)Snatch (2000) Django Unchained (2012) Drive (2011)

Kill Bill 1 (2003) Kill Bill 1 (2003) Inglourious Basterds (2009)Django Unchained (2012) Snatch (2000) Snatch (2000)

Kill Bill 2 (2004) Lock, Stock & Two Smoking Barrels (1998) Django Unchained (2012)Sin City (2005) Kill Bill 2 (2004) Lock, Stock & Two Smoking Barrels (1998)

Inglourious Basterds (2009) Sin City (2005) Kill Bill 1 (2003)Lock, Stock & Two Smoking Barrels (1998) Once Upon a Time in the West (1968) Sin City (2005)

Drive (2011) Drive (2011) Kill Bill 2 (2004)Once Upon a Time in the West (1968) From Dusk Till Dawn (1996) From Dusk Till Dawn (1996)

among the set.

We now examine what happens if a user uses a set of of movie-related keywords to seek movie

recommendations.

6.4.3 Keywords-Based Personalized Search

When a user uses a list of keywords, these keywords can be grouped to form a new “document”

that describes a type of movie that the user is interested in, and thus make recommendations

conditional on this new document. Specifically, when a user searches one or more keywords, we

locate these keywords in the vocabulary and obtain every keyword’s probability in each of the

K topic distributions. We sum the probabilities over the keywords to get the topic mix for the

new document. Then, we use Hellinger distance to determine the set of most similar movies, and

apply the individual specific coefficients to predict the likings and personalize the ranking of the

considered movies.

For instance, if a user searches for the two keywords “heartwarming” and “inspirational”, the

model can generate a set of ten movies that are closest to the new document consisting of these

two tags. These movies based upon population preference are “Braveheart (1995)”, “Forrest Gump

(1994)”, “Amistad (1997)”, “Ben-Hur (1959)”, “Dances with Wolves (1990)”, “Secondhand Lions

(2003)”, “Cider House Rules, The (1999)”,“Into the Wild (2007)”, “Fiddler on the Roof (1971)”,

and “Saving Private Ryan (1998)”. Again, for our three users, although their searches are identical,

because their preference differs, the final ranking and top movie recommendation are different across

them, as shown in Table 9. As we can see, “Braveheart (1995)” is a large-production action movie

among this set of “heartwarming, inspirational” films, which would please User 1, the “action

29

enthusiast”. User 2’s top recommendation is “Into the Wild (2007)”, which features a young

college grad who shuns the material world, cuts off communication with his family, lives off the

land in Alaska, and eventually dies due to food poisoning. Consistent with User 2’s preference,

this film’s production is small, and the storyline and the protagonist actions are unusual. “Saving

Private Ryan (1998)” is the top recommendation for User 3. Its story is based on a historical event

in WWII, which might please our “classics enthusiast”.

Table 9: Top Recommended Movies for the Three Users with Search of Keywords

User 1 User 2 User 3

Braveheart (1995) Into the Wild (2007) Saving Private Ryan (1998)Forrest Gump (1994) Saving Private Ryan (1998) Into the Wild (2007)

Saving Private Ryan (1998) Forrest Gump (1994) Ben-Hur (1959)Dances with Wolves (1990) Braveheart (1995) Braveheart (1995)Secondhand Lions (2003) Cider House Rules, The (1999) Fiddler on the Roof (1971)

Amistad (1997) Amistad (1997) Forrest Gump (1994)Cider House Rules, The (1999) Dances with Wolves (1990) Cider House Rules, The (1999)

Into the Wild (2007) Ben-Hur (1959) Dances with Wolves (1990)Fiddler on the Roof (1971) Secondhand Lions (2003) Amistad (1997)

Ben-Hur (1959) Fiddler on the Roof (1971) Secondhand Lions (2003)

7 Conclusion

In this research, we contribute to the literature on recommendation systems by developing a novel

cross-nested supervised latent Dirichlet allocation topic model that leverages both numerical rating

data and textual data to make movie recommendations. The latent topics are inferred from both

the tag vocabulary and the user ratings, which increase the predictive ability of the topics, and the

use of crowd-sourced tags alleviates the often onerous requirement for the firm-provided product

attributes. We apply this topic model on a datasets of 8.87 million movie ratings, and develop a

new stochastic variational Bayesian approach to estimate the model. We show that the resulting

recommendation model generates much better predictions than the benchmark model, yields a

number of interesting insights regarding user preferences on movies, and requires a short amount

of estimation time, thereby making it useful in actual recommendation contexts.

We also show how our modeling framework can be useful for generating targeted recommenda-

tions for users. In particular, we show how our model can support personalized search for movies

that are similar to a given movie. We show how the topic proportions for a given movie can be

30

used to generate a set of relevant movies that are similar to the focal movie. User level preference

parameters can then be used to generate a unique ordering of the relevant movies for each user,

thus resulting in personalized search rankings.

On the methodology front, we highlight the usefulness of the variational Bayesian framework for

marketing. The rapid growth in data volume, data variety, and crowd-sourcing that characterize

big data has opened new opportunities for deeper learning about customer preferences and for

using predictions obtained from such data for marketing actions. Novel and complex marketing

models such as ours that include many latent variables and parameters are needed to fully capitalize

on the information contained in the big data settings. Scalable inference therefore becomes one

of the major challenges that stems from the big data and big learning environments, and the

traditional approaches in Bayesian inference, such as MCMC, suffer from computational bottleneck

for practical big data applications. In contrast, by integrating the state-of-the-art advances of

variational Bayesian inference, we show how stochastic VB can offer a versatile solution to the

scalability and computational challenge in estimating complex marketing models for big data,

particularly so in the context of product recommendation on the internet.

While we focus on stochastic VB in the current paper, other methods to tackle scalability issues

are being actively investigated in the Bayesian literature. VB methods, especially those that are

based on fully factorized mean-field approximations can result in an underestimation of parameter

uncertainty. However, they are useful in marketing situations where predictive performance in a

managerially realistic amount of time is of critical importance. Many such scenarios are documented

in Cui and Curry (2005). Other recently proposed methods for scalable inference include stochastic

gradient based MCMC that relies on subsampling or divide-and-conquer approaches to speed up

MCMC algorithms. While these are active research areas, their utility in complex hierarchical

settings such as ours is yet to be demonstrated. In the end, all these scalable big data solutions entail

an important trade-off between unlimited accuracy in unlimited time and a good approximation

within a reasonable time, and different decision contexts imply different resolutions of this trade-

off (Bottou and Bousquet 2007). We leave a thorough comparison of these approaches for future

research.

31

References

Ansari, A., S. Essegaier and R. Kohli (2000). “Internet recommendation systems,” Journal ofMarketing Research, 37(3), 363-375.

Ansari, A., Y. Li and J. Z. Zhang (2016). “Stochastic Variational Bayesian Inference for Big DataMarketing Models,” Working Paper.

Bishop, C. (2006). Pattern Recognition and Machine Learning, New York, Springer.

Blei, D. and J. McAuliffe (2007). “Supervised topic models,” Neural Information ProcessingSystems, 21.

Blei, D., A. Ng and M. Jordan (2003). “Latent Dirichlet allocation,” Journal of Machine LearningResearch, 3, 993-1022.

Bodapati, A. V. (2008). “Recommendation systems with purchase data,” Journal of MarketingResearch, 45(1), 77-93.

Bottou, L. and O. Bousquet (2007). “The tradeoffs of large-scale learning,” Advances in neuralinformation processing systems, 20.

Braun, M. and J. McAuliffe (2010). “Variational inference for large-scale models of discretechoice,” Journal of the American Statistical Association, 105(489), 324-335.

Chung, J. and V. R. Rao (2012). “A general consumer preference model for experience products:application to internet recommendation services,” Journal of Marketing Research, 49(3), 289-305.

Chung, T. S., R. T. Rust and M. Wedel (2009). “My mobile music: an adaptive personalizationsystem for digital audio players,” Marketing Science, 28(1), 52-68.

Cui, D. and D. Curry (2005). “Prediction in marketing using the support vector machine,”Marketing Science, 24(4), 595-615.

Desrosiers, C. and G. Karypis (2011). “A comprehensive survey of neighborhood-based recom-mendation methods,” Recommender Systems Handbook, 107-144.

Dzyabura, D. and J. R. Hauser (2011), “Active machine learning for consideration heuristics,”Marketing Science, 30(5), 801-819.

Gaivoronski, A. (1988). “Implementation of stochastic quasigradient methods,” Numerical Tech-niques for Stochastic Optimization, 313-352, New York, Springer-Verlag.

Harper, M. and J. A. Konstan (2015). “The MovieLens datasets: history and context,” ACMTransactions on Interactive Intelligent Systems, 5(4).

Hastings, W. K. (1970). “Monte Carlo sampling methods using Markov chains and their applica-tions,” Biometrika, 57(1), 97-109.

32

Hoffman, M., D. M. Blei, C. Wang and J. Paisley (2013). “Stochastic variational inference,”Journal of Machine Learning Research, 14(1), 1303-1347.

Jordan, M. I., Z. Ghahramani, T. S. Jaakkola, L. K. Saul (1999). “An introduction to variationalmethods for graphical models,” Machine Learning, 37, 183-233.

Koren, Y. and R. Bell (2011). “Advances in collaborative filtering,” Recommender Systems Hand-book, 145-186.

Kullback, S. and R. A. Leibler (1951). “On information and sufficiency,” The Annals of Mathe-matical Statistics, 22, 79-86.

Lee, T. Y. and E. T. Bradlow (2011). “Automated marketing research using online customerreviews”, Journal of Marketing Research, 48(5), 881-894.

Lops, P., M. Gemmis and G. Semeraro (2011). “Content-based recommender systems: state ofthe art and trends,” Recommender Systems Handbook, 73-105.

Minka, T. P. (2001). “A family of algorithms for approximate Bayesian inference,” Ph.D. Thesis,MIT.

Murphy K. P. (2012). Machine Learning: A Probabilistic Perspective, The MIT Press.

Ormerod, J. T. and M. P. Wand (2010). “Explaining variational approximations,” The AmericanStatistician, 64(2), 140-153.

Rezende, D. J., S. Mohamed and D. Wierstra (2014). “Stochastic backpropagation and approxi-mate inference in deep generative models,” Proceedings of the 31st International Conferenceon Machine Learning, 32.

Robbins, H. and S. Monro (1951). “A stochastic approximation method,” Annals of MathematicalStatistics, 22, 400-407.

Rossi, P. E., R. E. McCulloch and G. M. Allenby (1996). “The value of purchase history data intarget marketing,” Marketing Science, 15(4), 321-340.

Spall, J. (2003). Introduction to Stochastic Search And Optimization: Estimation, Simulation,and Control, John Wiley and Sons.

Tan, L. S. L. (2015). “Stochastic variational inference for large-scale discrete choice models usingadaptive batch sizes,” Statistics and Computing, 1-21.

Tirunillai, S. and G. J. Tellis (2014). “Mining marketing meaning from online chatter: strategicbrand analysis of big data using latent Dirichlet allocation”, Journal of Marketing Research,51(4), 463-479.

Toulis, P. and E. M. Airoldi (2016). “Implicit stochastic gradient descent,” Annals of Statistics,forthcoming.

Ying, Y., F. Feinberg and M. Wedel (2006). “Leveraging missing ratings to improve online rec-ommendation systems,” Journal of Marketing Research, 43(3), 355-365.

33

Zhang, J. and L. Krishnamurthi (2004). “Customizing promotions in online stores,” MarketingScience, 23(4), 561-578.

34

Appendices

A Stochastic VB with Adaptive Mini-batch Sizes

To adaptively vary the mini-batch size to achieve faster convergence, we start the stochastic VB bysampling small batches of the data, and increase the batch size for a VB iteration only if informationfrom the current batch no longer suffices to move the optimization towards the appropriate direction.We follow the “ratio of progress and path” (RPP) strategy in Tan (2015) to determine whetherto keep or to change batch size. RPP indicates the extent on how much the parameter values aremoving monotonically towards optimum in the past M iterations, as oppose to merely bouncingback and forth around the optimum. In iteration t, RPP is calculated as (Gaivoronski 1988)

RPP =|θ(t−M) − θ(t)|∑t−1

r=t−M |θ(r) − θ(r+1)|, (8)

where θ(t) is the scalar component of the population level parameter vector in iteration t. TheRPP is a measure bounded between zero and one. It equals zero if θ(l−M) = θ(l), a sign thatno real progress was made after M iterations. It equals one if the path from θ(l−M) to θ(l) isperfectly monotonic. Between zero and one, a small RPP implies that the optimization processexhibits considerable oscillating behavior, thus more resolution and information are needed fromdata, whereas a big RPP signals relatively more monotonic progress towards optimum. In iter-ation t, if the RPP falls below a pre-specified threshold ν, we increase the batch size by a fixedsmall percentage. When the mini-batch reaches the full data set, we continue the optimization byswitching to the non-stochastic VB till convergence. Using the RPP indicator, we essentially allowthe estimation procedure to determine by itself the most appropriate batch size to use during theprocess of optimization.

In applying adaptive mini-batch sizes to the CGSTM model, we take a random sample S(t)

from the total of I individuals in iteration t, and calculate RPP at the end of the iteration (till themini-batch reaches the full data). For each sampled individual, the updating of their variationalparameters, µq(γi) and Σq(γi), remain the same, whereas the updating of population level variationalparameters such as µq(β) and Rq(Γ) changes from

µq(β) = Σq(β)

(Σ−1β µβ + ρq(Γ)R

−1q(Γ)

I∑i=1

µq(γi)

)

Rq(Γ) = RΓ + IΣq(β) +

I∑i=1

[(µq(γi) − µq(β))(µq(γi) − µq(β))

′ + Σq(γi)

],

to the following

µq(β) = (1− π)µ(t−1)q(β) + πΣq(β)

(Σ−1β µβ + ρq(Γ)R

−1q(Γ)

I

|S(t)|∑i∈S(t)

µq(γi)

)

35

Rq(Γ) = (1− π)R(t−1)q(Γ) + π

{RΓ + IΣq(β) +

I

|S(t)|∑i∈S(t)

[(µq(γi) − µq(β))(µq(γi) − µq(β))

′ + Σq(γi)

]}.

In above, the exponential average between the current and the last iterations with weight π ensuresthe convergence of the stochastic optimization (Hoffman et al. 2013). The multiplier I/|S(t)| rescalesthe intermediate estimate as if the entire data (i.e., all individuals) are used for the current updating.The non-stochastic version of VB is recovered when π = 1 and |S(t)| = I, i.e., the mini-batchincludes all individuals from the original dataset. In order to move the parameter estimates quicklytowards the optimum, we further allow the RPP threshold ν and the weight π to linearly vary withthe mini-batch size across iterations, reflecting the higher tolerance and less precision we are willingto accept in early VB iterations.

B Stochastic VB Updates for the CGSTM Model

We estimate the CGSTM model via stochastic VB with adaptive mini-batch sizes. In each iteration,we sample a subset of users S(t) using the RPP strategy discussed in Appendix A. Denote thedocuments rated by the sampled individuals of current iteration as Ds, where Ds = ∪i∈S(t)Di. Wefirst update the variational parameters associated with document d, ∀d ∈ Ds. Let ud = logαd, thusud ∼ N (x′dν, δ

2). In particularly, the ADSVB method detailed in Ansari, Li and Zhang (2016) isapplied to update the variational parameters in q(ud) = N (µq(ud), σ

2q(ud)). We derive the gradient

for ADSVB as

∂ log p(ωd, ud)

∂ud= Keud

(Ψ(Keud)−Ψ(eud)

)+eud

K∑k=1

(Ψ(ζdk)−Ψ(

K∑k=1

ζdk))−aq(δ2)

aq(δ2)

(ud−x′dµq(ν)

),

(9)where Ψ(·) is the digamma function.

Within document d, the variational parameters for topic proportions ωd, ∀d ∈ Ds are updatedby,

ζd = exp{µq(ud) + 0.5 σ2q(ud)}+

Nd∑n=1

φdn. (10)

The variational parameters for topic assignment of word wdn are calculated by

φdn ∝ τn exp{

Ψ(ζd) +1

Nd

aq(σ2)

bq(σ2)

∑i∈Id

ydiµq(γi)

− 1

2N2d

aq(σ2)

bq(σ2)

∑i∈Id

[2(Σq(γi) + µq(γi)µ

′q(γi)

)φd,−n + µ2q(γi)

+ diag(Σq(γi))]}, (11)

where, φd,−n =∑

j 6=nφj , and Id denotes the sampled individuals who rated document d. Then,

36

we update the variational parameters of the individual coefficient γi for every sampled user,

Σ−1q(γi)

= ρq(Γ)R−1q(Γ) +

aq(σ2)

bq(σ2)E(A′iAi) (12)

µq(γi) = Σq(γi)

[ρq(Γ)Rq(Γ)µq(β) +

aq(σ2)

bq(σ2)E(Ai)yi

], (13)

where E(Ai) is a |Di|×K matrix with vectors φd on each row, and E(A′iAi) =∑

d∈Di

1N2

d

[∑m 6=nφdnφ

′dm+

diag(φdn)].

Next, we update the population level parameters using the weighted average discussed in SectionA. Specifically, the topic distribution is given by,

τkw ∝ (1− π)τ(t−1)kw + π

D∑d=1

Nd∑n=1

1(wdn = w)φdnk. (14)

The variational parameters for the population covariance matrix Γ are updated by,

ρq(Γ) = ρΓ + I (15)

Rq(Γ) = (1− π)R(t−1)q(Γ) + π

{RΓ + IΣq(β) +

I

|S(t)|∑i∈S(t)

[(µq(β) − µq(γi))(µq(β) − µq(γi))

′ + Σq(γi)

]}(16)

The variational parameters for the population mean β are given by,

Σ−1q(β) = Σ−1

β + I · ρq(Γ)R−1q(Γ) (17)

µq(β) = (1− π)µ(t−1)q(β) + πΣq(β)

[Σ−1β µβ + ρq(Γ)R

−1q(Γ)

I

|S(t)|∑i∈S(t)

µq(γi)]. (18)

The variational parameters for the error variance σ2 are calculated as,

aq(σ2) = aσ2 +1

2

D∑d=1

Id (19)

bq(σ2) = (1− π)b(t−1)q(σ2)

+ π{bσ2 +

1

2

I

|S(t)|∑i∈S(t)

(‖yi − E(Ai)µq(γi)‖

2 + Tr[E(A′iAi)Σq(γi)

])}, (20)

where Id denotes the total number of individuals who rated document d.

The variational parameters for the aggregate attribute coefficient ν are given by,

Σ−1q(ν) = Σ−1

ν +aq(δ2)

bq(δ2)X ′X (21)

37

µq(ν) = (1− π)µ(t−1)q(ν) + πΣq(ν)

[Σ−1ν µν +

I

|S(t)|aq(δ2)

bq(δ2)X ′Ds{µq(ud)}d∈Ds ]. (22)

Lastly, the variational parameters for the error variance δ2 are calculated as,

aq(δ2) = aδ2 +D

2(23)

bq(δ2) = (1− π)b(t−1)q(δ2)

+ π[bδ2 +

1

2

I

|S(t)|∑d∈Ds

((µq(ud) − x′dµq(ν))

2 + x′dΣq(ν)xd + σ2q(ud)

)]. (24)

38