[adma 2017] identification of grey sheep users by histogram intersection in recommender systems
TRANSCRIPT
Yong Zheng, Mayur Agnani, Mili Singh
Illinois Institute of TechnologyChicago, IL, 60616, USA
2017 International Conference on Advanced Data Mining and Applications, Singapore, Nov 5-6, 2017
Identification of Grey Sheep Users By Histogram Intersection In Recommender Systems
Agenda
• Background: Recommender Systems
• Grey Sheep Users In Collaborative Filtering
• Methodology and Solutions
• Experimental Results
• Conclusions and Future Work
2
Traditional Recommendation Algorithms
4
Content-Based Recommendation AlgorithmsThe user will be recommended items similar to the ones the user preferred in the past, such as book/movie recsys
Collaborative Filtering Based Recommendation AlgorithmsThe user will be recommended items that people with similar tastes and preferences liked in the past, e.g., movie recsys
Hybrid Recommendation AlgorithmsCombine content-based and collaborative filtering based algorithms to produce item recommendations.
Collaborative Filtering: Algorithms
5
User-Based KNN Collaborative Filtering (UBCF)Assumption: a user u’s rating on item t is similar to other users’ rating on item t, while this group of similar users is called user K-nearest neighbor
Pirates of the Caribbean 4
Kung Fu Panda 2 Harry Potter 6 Harry Potter 7
U1 4 4 1 2
U2 3 4 2 1
U3 2 2 4 4
U4 4 4 1 ?
Collaborative Filtering: Algorithms
6
User-Based KNN Collaborative Filtering (UBCF)
Pirates of the Caribbean 4
Kung Fu Panda 2 Harry Potter 6 Harry Potter 7
U1 4 4 1 2
U2 3 4 2 1
U3 2 2 4 4
U4 4 4 1 ?
a = the target useri = the target item
N = user neighborhoodu = a user neighbor in N
Collaborative Filtering: Algorithms
7
Popular Challenges in Collaborative Filtering
Data sparsity problems
Cold-start users or items
Grey-sheep users
Incorporate content into collaborative filtering
….
Grey Sheep Users
8
Definition 1 by Mark Claypool, et al., 1999
A group of users who neither agree nor disagree with any group of users. Therefore, they will not benefit from the user-based collaborative filtering technique
Clustering Technique by Ghazanfar, et al., 2011
Distribution of User Ratings by Gras, et al., 2016
Definition 2 by John McCrae, et al., 2004
White Sheep Users may have high correlations with other users; Black Sheep Users have very few or no correlating users; Grey Sheep Users own unusual tastes and low correlations with others
Distribution of User Similarities by Zheng, et al., 2017
Proposed Solution
10
Approach Based on The Distribution of User Similarities
White Sheep Users: high correlations with other users
Black Sheep Users: very few or no correlating users
Grey Sheep Users: unusual tastes, low correlations with others
The Distribution of User-User Correlations or Similarities
Proposed Solution
11
Proposed Solution
Step 1, represent each user as distribution of user correlations
Step 2, select good and bad examples
Step 3, apply outlier detection on selected examples. Grey sheep users are the intersections of bad examples and identified outliers
Step4, examine the quality of identified grey sheep users
Proposed Solution
12
Step 1, Distribution Representations
We calculate user-user correlations by cosine similarity
Obtain the descriptive statistics of the distribution
Proposed Solution
13
Step 2, Example Selection
Good examples: high correlations and left-skewed
Bad examples: low correlations and right-skewed
Proposed Solution
14
Step 3, Outlier Detection by Local Outlier Factor (LOF)
LOF helps identify outliers by the local density
Observations with LOF > 1 will be considered as outliers
We set different threshold values to findthe optimal one for identifying grey sheep users, for example
LOF threshold = 1.0LOF threshold = 1.1LOF threshold = 1.2LOF threshold = ….
Proposed Solution
15
Step 4, Examine the quality of identified GS Users
The parameters in our solution
Example Selection
LOF threshold
Neighbor of neighborhood in LOF method
Our goals or examination criteria
To find as many GS users as possible
Recommendation by UBCF should be worse for GS users than non-GS users
Improved Approach
16
Drawback
Cosine similarities reply on co-ratings. If two users did not rate items in common, we are not able to measure their similarities
Improved Approach
We represent each user as its similarity distribution
The distribution can be represented by a histogram
The interaction of two histograms tells the user-user similarity
Experimental Setting
• Data: MovieLens 100K rating data
– 100K ratings
– 1K users
– 1.7K movies
– Each user has rated at least 20 movies
• Evaluation
– 80% as training, 20% as testing
– Mean absolute error, MAE, to eval rating predictions
17
Conclusions
• We develop a novel approach to identify GS users by utilizing the definition related to the user-user correlations
• We propose to use histogram intersection to better measure user-user similarities
• Our approach is demonstrated to work better than others based on the MovieLens 100K data
21
Future Work
• Try it on other data sets
• Seek approaches to improve the recommendation performance for the group of Grey Sheep Users
22