journal club: meta-prod2vec

Meta-Prod2Vec - Product Embeddings Using Side-Information for Recommendation

Yuya Kanemoto

Vasile F et al. RecSys 2016

Neural embedding: Word2Vec (Skip-gram)• A method for learning distributed vector representations that capture a large

number of syntactic and semantic word relationships

• Example: Tokyo - Japan + Germany = Berlin

• Word2Vec is essentially a two-layer neural network

• Objective function:

Mikolov T et al. 2013

Skip-gram with negative sampling

• Data sets are often too large to perform SGD as iterations at the denominator of conditional probability takes time

• We could set the task to distinguish the target word co-occurrences and k negative samples

Mikolov T et al. 2013

: Objective function

: Objective function with negative sampling

Embedding and Matrix Factorisation

• The objective of the embedding is closely related to matrix factorisation

• Embedding can be considered as decomposition of SPMI (shifted pointwise mutual information) matrix

Levy O et al. 2014

Neural embedding: Prod2Vec

• A method applying Skip-gram model for product recommendation

• When an user buys a product, products with similar vector representation will be recommended

Grbovic M et al. 2015

Prod2Vec for popular songs

“Shake It Off” “All About That Bass”

Vasile F et al. 2016

Prod2Vec in cold start case

“You’re Not Sorry” “Du Hast”


Meta-Prod2Vec constraints

• Meta-Prod2Vec = Prod2Vec + product meta-data

• The aim is to deal with cold start problems


Loss function of Prod2Vec


Negative sampling for Meta-Prod2Vec


Loss function of Meta-Prod2Vec


I: input J: output M: meta-data

Evaluation of experiments


• Hit ratio at K (HR@K): whether product appears in the top K list of recommended products (doesn’t care the rank of test product in the recommendation list)

• Normalised discounted cumulative gain (NDCG@K): measurement of the performance of a recommendation system based on the graded relevance of the recommended entities. It varies from 0 to 1, with 1 representing the ideal ranking of the entities.

IDCG is the maximum possible (ideal) DCG for a given set of queries rel: graded relevance of the result at position i k: maximum number of entities that can be recommended

Methods for comparison


• BestOf: based on popularity

• CoCounts: based on cosine similarity (basic collaborative filtering)

• Prod2Vec

• Meta-Prod2Vec

• Mix(Prod2Vec,CoCounts):

• Mix(Meta-Prod2Vec,CoCounts):

Parameters Number of songs: 433k Number of artists: 67k Embedding dimension: 50 Context window size: 3 λ: 1 α: 0.15

Relative importance of meta data


Improvement in cold start


Cold start

Improvement in cold start


Better performance in ensemble model


Discussion

• Meta data was informative, especially for cold start case

• Ensemble method (with 15% Meta-Prod2Vec) worked well

• No comparison with matrix factorisation methods/other meta-data

utilising Word2Vec variants

journal club: meta-prod2vec

Data & Analytics