introduction to r package recommendation system competition
DESCRIPTION
John Myles White's Introduction to R Package Recommendation System CompetitionTRANSCRIPT
R Recommendation System Contest
John Myles White
March 10, 2011
John Myles White R Recommendation System Contest
Kaggle
Kaggle is a platform for data prediction competitionsthat allows organizations to post their data and have itscrutinized by the world’s best data scientists.
John Myles White R Recommendation System Contest
Kaggle Features
Kaggle provides every contest with:
I Centralized data downloads
I Public and private leaderboards using RMSE, AUC and othermetrics
I Public discussion forums for participants to use
John Myles White R Recommendation System Contest
Kaggle Features
John Myles White R Recommendation System Contest
Recent Kaggle Contests
I Tourism Forecasting
I Chess Ratings: Elo versus the Rest of the World
I INFORMS 2010: Short Term Stock Price Movements
John Myles White R Recommendation System Contest
Current and Upcoming Kaggle Contests
I Arabic Writer Identification
I Don’t Overfit: Dealing with Many Variables and FewObservations
I Heritage Health Prize
John Myles White R Recommendation System Contest
Advice on Running Kaggle Contests
I Stay involved: respond to forum posts quickly and make thecontest seem alive
I Don’t use a prediction task where near perfect accuracy canbe achieved
John Myles White R Recommendation System Contest
Mistakes We Made
I Netflix Prize: 0.8616 RMSE
I R Recommendation Contest: 0.9882 AUC
John Myles White R Recommendation System Contest
The R Recommendation System Contest
I Contestants must be able to predict whether a user U willhave a package P installed on their system
John Myles White R Recommendation System Contest
Full Data Set
I Outcomes: List of all packages installed on 52 R users’systems
I Predictors: Metadata about 2485 CRAN packages
John Myles White R Recommendation System Contest
Metadata
I Dependencies
I Suggests
I Imports
I Views
I Core
I Recommended
I Maintainer
I Maintainer’s Package Count
John Myles White R Recommendation System Contest
Training Data / Test Data Split
I Uniform random split over rows in full data set
I Training Set: 99373 rows
I Test Set: 33125 rows
John Myles White R Recommendation System Contest
Additional Metadata
I LDA topic assignments for CRAN packages
I Used 25 topics
I Used all documentation: manuals, vignettes, etc.
John Myles White R Recommendation System Contest
Example Models
1. Package Metadata
2. Package Metadata + Per User Intercepts
3. Package Metadata + Per User Intercepts + Package TopicAssignments
John Myles White R Recommendation System Contest
Example Model 1
library(‘ProjectTemplate’)try(load.project())
logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage,
data = training.data,family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
Example Model 2
logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User),
data = training.data,family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
Example Model 3
logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User) +Topic,
data = training.data,family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
Model Performance
I Model 1: ∼ 0.80 AUC
I Model 2: ∼ 0.95 AUC
I Model 3: > 0.95 AUC
John Myles White R Recommendation System Contest
Unexploited Structure in Data
John Myles White R Recommendation System Contest
Future Work
What makes a package useful?
I Need subjective ratings
I Some packages are only installed because they’redependencies for other popular packages
John Myles White R Recommendation System Contest
Future Work
Get a better data sample:
I Contest only used data from 52 users
I But we do have complete data for those users
I But data was not a random sample of R users
John Myles White R Recommendation System Contest
Future Work
I Do more with LDA to categorize R packages
I Prediction task allows us to evaluate “quality” of topics countand topic assignments
John Myles White R Recommendation System Contest
Future Work
I Build up various package-package similarity matrices forconditional recommendations
John Myles White R Recommendation System Contest
Future Work
I Can we understand the clustering in the network structuregraph?
John Myles White R Recommendation System Contest
Resources
For more information, see
I The original Dataists’ contest announcement
I GitHub project page
John Myles White R Recommendation System Contest