introduction to r package recommendation system competition

R Recommendation System Contest

John Myles White

March 10, 2011

John Myles White R Recommendation System Contest

Kaggle

Kaggle is a platform for data prediction competitionsthat allows organizations to post their data and have itscrutinized by the world’s best data scientists.


Kaggle Features

Kaggle provides every contest with:

I Centralized data downloads

I Public and private leaderboards using RMSE, AUC and othermetrics

I Public discussion forums for participants to use


Kaggle Features


Recent Kaggle Contests

I Tourism Forecasting

I Chess Ratings: Elo versus the Rest of the World

I INFORMS 2010: Short Term Stock Price Movements


Current and Upcoming Kaggle Contests

I Arabic Writer Identification

I Don’t Overfit: Dealing with Many Variables and FewObservations

I Heritage Health Prize


Advice on Running Kaggle Contests

I Stay involved: respond to forum posts quickly and make thecontest seem alive

I Don’t use a prediction task where near perfect accuracy canbe achieved


Mistakes We Made

I Netflix Prize: 0.8616 RMSE

I R Recommendation Contest: 0.9882 AUC


The R Recommendation System Contest

I Contestants must be able to predict whether a user U willhave a package P installed on their system


Full Data Set

I Outcomes: List of all packages installed on 52 R users’systems

I Predictors: Metadata about 2485 CRAN packages


Metadata

I Dependencies

I Suggests

I Imports

I Views

I Core

I Recommended

I Maintainer

I Maintainer’s Package Count


Training Data / Test Data Split

I Uniform random split over rows in full data set

I Training Set: 99373 rows

I Test Set: 33125 rows


Additional Metadata

I LDA topic assignments for CRAN packages

I Used 25 topics

I Used all documentation: manuals, vignettes, etc.


Example Models

1. Package Metadata

2. Package Metadata + Per User Intercepts

3. Package Metadata + Per User Intercepts + Package TopicAssignments


Example Model 1

library(‘ProjectTemplate’)try(load.project())

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage,

data = training.data,family = binomial(link = ‘logit’))


Example Model 2

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User),



Example Model 3

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User) +Topic,



Model Performance

I Model 1: ∼ 0.80 AUC

I Model 2: ∼ 0.95 AUC

I Model 3: > 0.95 AUC


Unexploited Structure in Data


Future Work

What makes a package useful?

I Need subjective ratings

I Some packages are only installed because they’redependencies for other popular packages


Future Work

Get a better data sample:

I Contest only used data from 52 users

I But we do have complete data for those users

I But data was not a random sample of R users


Future Work

I Do more with LDA to categorize R packages

I Prediction task allows us to evaluate “quality” of topics countand topic assignments


Future Work

I Build up various package-package similarity matrices forconditional recommendations


Future Work

I Can we understand the clustering in the network structuregraph?


Resources

For more information, see

I The original Dataists’ contest announcement

I GitHub project page


http://www.dataists.com/2010/10/using-data-tools-to-find-data-tools-the-yo-dawg-of-data-hacking/

https://github.com/johnmyleswhite/r_recommendation_system

introduction to r package recommendation system competition

Documents

rmse r recommendation

r userssystemspredictors

kaggle kaggle

kaggle features kaggle

data setoutcomes

data settraining set

data prediction competitions

package p