the $1,000,000 netflix contest

57
The $1,000,000 Netflix Contest is to develop a "ratings prediction programthat can beat Netflix’s (called Cinematch) by 10% in predicting what rating users gave to movies. I.e., predict rating(M,U) where (M,U) QUALIFYING(MovieID, UserID). Netflix uses Cinematch to decide which movies a user will probably like next (based on all past rating history). All ratings are "5-star" ratings (5 is highest. 1 is lowest. Caution: 0 means “did not rate”). Unfortunately rating=0 does not mean that the user "disliked" that movie, but that it wasn't rated at all. Most “ratings” are 0. Therefore, the ratings data sets are NOT vector spaces! One can approach the Netflix contest problem as a data mining Classification or Prediction problem. A "history of ratings by users to movies“, TRAINING(MovieID, UserID, Rating, Date) is given with which to train your predictor, which will predict the ratings given to QUALIFYING movie-user pairs (Netflix knows the rating given to Qualifying pairs, but you don't.) Since the TRAINING is very large, Netflix also provides a “smaller, but representative subset” of TRAINING, PROBE(MovieID, UserID) (~2 orders of magnitude smaller than TRAINING). Netflix gives 5 years to submit QUALIFYING predictions. That contest window is

Upload: shay

Post on 12-Jan-2016

61 views

Category:

Documents


3 download

DESCRIPTION

The $1,000,000 Netflix Contest. is to develop a " ratings prediction program “ that can beat Netflix ’ s (called Cinematch) by 10% in predicting what rating users gave to movies. I.e., predict rating(M,U) where (M,U)  QUALIFYING(MovieID, UserID) . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The $1,000,000 Netflix Contest

The $1,000,000 Netflix Contest is to develop a "ratings prediction program“that can beat Netflix’s (called Cinematch) by 10% in predicting what rating users gave to movies.

I.e., predict rating(M,U) where (M,U) QUALIFYING(MovieID, UserID).

Netflix uses Cinematch to decide which movies a user will probably like next (based on all past rating history). All ratings are "5-star" ratings (5 is highest. 1 is lowest. Caution: 0 means “did not rate”).

Unfortunately rating=0 does not mean that the user "disliked" that movie, but that it wasn't rated at all. Most “ratings” are 0. Therefore, the ratings data sets are NOT vector spaces!

One can approach the Netflix contest problem as a data mining Classification or Prediction problem.

A "history of ratings by users to movies“, TRAINING(MovieID, UserID, Rating, Date) is given with which to train your predictor, which will predict the ratings given to QUALIFYING movie-user pairs (Netflix knows the rating given to Qualifying pairs, but you don't.)

Since the TRAINING is very large, Netflix also provides a “smaller, but representative subset” of TRAINING,

PROBE(MovieID, UserID) (~2 orders of magnitude smaller than TRAINING).

Netflix gives 5 years to submit QUALIFYING predictions. That contest window is about 1/2 gone now.

A team can submit as many solution as they wish and at any time. Each October, Netflix give $50,000 to the team on top the so-called Netflix Leaderboard. Bellcore has won that twice.

Page 2: The $1,000,000 Netflix Contest

The Netflix Contest (USER versus MOVIE voting)

One can address the prediction or classification problem using several different "approaches".

USER VOTERs (approach 1): To predict the rating of a pair, (M,U), we take TRAINING as a vector space of user ratings vectors. The users are the points in the vector space and the movies are the dimensions in that vector space. Since there are 17,770 movies each user is tuple of 17770 ratings, if all movies are used as dimensions. That’s too many dimensions! The first dimension pruning: restrict to only those movies that U has rated ( =supportU ). We also allow another round of dimension pruning based on correlation with M.

Once the dimensions movie set is pruned, we pick a “Set of Near Neighbor users to U”, (NNS) from the users, V, who have rated M ( =supportM ). “Near” is defined based on correlation with U. One can think of this step as the voter pruning step. Note: most correlations calculations involve the other variable also. I.e., the result of a user pruning depends on the pruned movie set and vice versa. Thus, theoretically, the movie/user pruning steps could be alternated ad infinitum! Our current approach is to allow an initial global dimension prune, then the voter prune, then a final dimension prune. You will see these 3 prune steps in the .config files.

We then let voters vote, but they don’t necessarily cast the straight-forward rating(M,V) vote.

The best way to think about the 3 pruning steps (and there could be more!) is: We prune down the dimensions so that vector space methods are tractable, emeliorating the curse of dimensionality (the first, which may be turned off, is a global dimension prune (not based on individual voters). The second is the voter prune based on the currently pruned dimensions. The third is a final dimension prune (different for each voter) which give the final vector space over which the vote by that voter is calculated. Then we let those VOTERS vote as to the best rating prediction to be made.

There are many ways to prune, vote, tally, and decide on the final prediction. These choices make up the .config file.

MOVIE VOTERs (approach 2) is identical with roles of Movies (voters) and Users (dimensions) reversed

Page 3: The $1,000,000 Netflix Contest

The Netflix Contest (Using SLURM to generate a clustering)

SLURM has been set up to run on the Penryn Cluster2 (32 8 processor machines - 1 terrabyte of main memory) so that one can create a .config file (must end in .config) which specifies all the parameters for the program. Issuing:

./mpp-submit -S -i Data/probe-full.txt -c pf.0001/u.00.00/u.00.00.config -t .0001 -d ./pf.0001

The program pulls parameters from .config: -t .0001 means SquareError threshold = .0001 -d ./pf.0001 means results goto ./pf.0001 dir. The prog takes as input, the file Data/probe-full.txt (which is not quite the full probe but close) with format:

mpp-submit –S –i InputFile.txt –c ConfigFile.config –t SqErrThrhd –d Dir

TakesInputFile.txt (MovieID with interleaved UserIDs format or .txt format. See next slide)ConfigFile.config (shows which program to run. In .config format. See next slide)SqErrThrhld (if PredictionSqErr ≤ SqErrThrhld, put pair in Dir/lo-InputFile.txt, else put in Dir/hi-InputFile.txt)Directory (existing directory for the output)as input

Puts as output (in Dir)lo-InputFileName.txtHi-InputFileName.txtInputFileName.configInputFileName.rmse

Page 4: The $1,000,000 Netflix Contest

The Netflix Contest (Using SLURM to generate a clustering)

./mpp-submit -S -i Data/probe-full.txt -c pf.0001/u.00.00/u.00.00.config -t .0001 -d ./pf.0001

InputFile ConfigFile: pf.0001/u.00.00/u.00.00.configData/probe-full.txt

1: 30878 2647871 1283744 2488120 317050 1904905 1989766 14756 1027056 1149588 1394012 1406595 2529547 1682104 2625019 2603381 1774623 470861 712610 1772839 1059319 2380848 548064 2: 1959936 748922 1131325 1312846 2314531 1636093 584750 2418486 715897 1172326 etc.

where 1: and 2: are movieIDs and the others are userIDs. Note, this in an interleaved format of a 2-column DB file, probe-full(movieID,userID)

Program sets parameters as specified in the .config:user_voting = enabled movie_voting = disabled user_vote_weight = 1

# processed only if user voting enabled. [user_voting] Prune_Movie_in_SupU = disabledPrune_Users_in_SupM = enabled Prune_Movies_in_CoSupUV = enabled

[Prune_Movies_in_SupU] method=MoviePrune leftside = 0 width = 30000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1

[Prune_Movies_in_CoSupUV] method=MovieCommonCoSupportPrune leftside = 0 width = 2000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1(Part identical to blue for movie voting params)

[Prune_Users_in_SupM] method=UserCommonCoSupportPrune leftside = 0 width = 30000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1

Only the method, leftside, width, Ch=Choice, Ct=Count parameters are used at this time.

Using this program, the many "lo-u.xx.xx" and, if movie voting is also enabled, "lo-m.yy.yy" files constitute what we have called a clustering (tho they’re not mutually exclusive). Once we have {z-lo.xx.yy | z=u of m } we can make a submission by: qualifying pair (m,u), use correlations to pick program to make that prediction.

Page 5: The $1,000,000 Netflix Contest

The Netflix Contest (Using this scheme to predict Qualifying pair ratings)

The above prediction scheme requires the existence of Square Errors (SqErr),e.g., clusters files, lo-u.vv.nn.txt and lo-m.nn.vv.txt are composed of all input pairs such that SqErr ≤ .0001

To predict rating(M,U) for pairs from Qualifying, we won’t have answers, so we won’t have SqErrs of our predictions relative to those answers.

So how can we form good cluster then?

Once that’s decided what matchup algorithm should we use to match a cluster (program) to a Qualifying pair to be predicted?

After the clusters are created, we can try the matchup algorithms that worked best for Probe predictions, but

We may want to develop new ones because the performance of those matchup algorithms may depend on the way the clusters were created.

We could use the same 288 configs to generate a new config-subset-collection of Qualifying pairs using, e.g., prediction some kind of prediction variation instead of thresholded prediction SqErr?

lo-u.vv.nn.txt could be constructed to consist of Qualifying pairs as follows (a variation based method):Set all answers in Qualifying to 1. Use ./mpp-submit to create clusters as above (threshold=.0001) in a directory, q1. Set all answers in Qualifying to 2. Use ./mpp-submit to create clusters as above (threshold=.0001) in a directory, q2, etc. This will create a clustering of 288*5=1440 cluster sets (but, of course, only 288 different programs configs).

One could matchup a Qualifying pair using count-based correlations, Pearson-correlations, 1-perpendicular-correlations, or?One could matchup (M,U) with the cluster in which the sum of the M and U counts (or counts relative to cluster size) is max?Other?

Page 6: The $1,000,000 Netflix Contest

uID mID rating day_number u1 m 1 ru,m du,m

u1 m2

. . .

u480189 m17770

--

----

-- 1

7,77

0 -

----

----

----

-

Training (Uid,Mid,R,D) ordered by Uid:

u1 ... uk ... u480189 m1

: mh

:

m17770

rmhuk

m\u

47B

The Netflix Files {Mi} i=1..17770 given by Netflix as:

uID rating date u i1 rmk,u dmk,u

ui2

. . .

ui ni

Mi: u\rd

avg:5655u/m

mID uID rating day_number m1 u 1 rm,u dm,u

m1 u2

. . .

m17770 u480189 r17770,480189 d17770,480189

or U2649429 --

----

-- 1

00,4

80,5

07 -

----

---

Training (Mid,Uid,R,D) orderd by Mid:

avg:209m/u

TRAINING as M-U interaction cube (Rolodex Model, m\u)

Pmh, 2

Pu480189,0

TRAINING in MySQL with key (mID, uID) 11-bit day numbers starting at 1=1/1/99 and ending at 2922=12/31/06.

day_numbers

m1

mh

m177701

u1 uk u480189

b=0

0

b=1

b=2

b=3

b=4

..

.

1

0

0

1

0

0

b=13

1

ratings

Mi ( uID, Rating, Date )For each MovieID, Mi, this is a file of all users who rated it, the rating, the rating date.

bit-sliced TRAINING: M-U interaction cube (Rolodex Model, m\u)

TRAINING in MySQL with key (uID, mID) 11-bit day numbers starting at 1=1/1/99 and ending at 2922=12/31/06.

Page 7: The $1,000,000 Netflix Contest

The Program: Code Structure - the main modules mpp-mpred.C

mpp-user.C

user-vote.C movie-vote.C

prune.C

mpp-mpred.C reads a Neflix PROBE file Mi(Uid) and passes Mi and ProbeSupport(Mi) to mpp-user.C to make predictions for each pair (Mi,U), foreach UProbeSupport(Mi).It can also calls separate instances of mpp-user.C for many Us, to be processed in parallel (governed by the number of "slots" specified in 1st code line.)

mpp-user.C loops thru ProbeSupport(M), the ULOOP, reading in the designated (matchedup) config file, then writing out a (Mi,U) prediction for each U.

If the user-vote-approach is used , mpp-user.C calls user-vote.C, passing it (M, Support(M), U, Support(U)). If the movie-vote-approach is used, mpp-user.C calls movie-vote.C, passing it (M, Support(M), U, Support(U).

user-vote.C does the specified pruning by calling prune.C, looping through the pruned set of user voters, V, calculating a vote for each, combining those votes and returning a prediction_vote(M,U)

movie-vote.C does similarly.

Page 8: The $1,000,000 Netflix Contest

What kind of pruning can be specified? mpp-mpred.C

mpp-user.C

user-vote.C movie-vote.C

prune.C

There are up to 3 types of pruning used (for pruning down support(M) as the set of all users that rate M or pruning down support(U) as the set of all movies that rate U:

1. correlation or similarity threshold based pruning2. count based pruning3. ID window based pruning

Under correlation or similarity threshold based pruning, and using support(M)=supM for example (pruning support(U) is similar) we allow any function f:supMsupM [0,HighValue] to be called a user correlation provided only that f(u,u)=HighValue for every u in supM. Examples include Pearson_Correlation, Gaussian_of_Distance, 1_perp_Correlation (see appendix of these notes), relative_exact_rating_match_count (Tingda is using), dimension_of_common_cosupport, and functions based on Standard Deviations.

Under count based pruning, we usually order by one of the correlations above first (into a multimap) then prune down to a specified count of the most highly correlated.

Under ID window based pruning we prune down to a window of userIDs within supM (or movieIDs within supU) by specifying a leftside (number added to U, so leftside is relative to U as a userID) and a width.

Again, all parameters are specified in a configuration file and the values specified there are consumed at runtime using, e.g., the call:

mpp -i Input_.txt_file -c config -n 16 where Input_.txt_file is the input Probe subset file and 16 is the number of parallel threads that mpp-mpred.C will generate (here, 16 movies are processed in parallel, each sent to a separate instantiation of mpp-user.C)

A sample config file is given later.

Page 9: The $1,000,000 Netflix Contest

How does one specify prunings? mpp-mpred.C

mpp-user.C

user-vote.C movie-vote.C

prune.C

Again, in a file (this one is named config) there is a section for specifying the parameters for user-voting and a separate section for specifying parameters for movie-voting. E.g., for movie voting, at the bottom, there are 3 external prunings possible (0 or more can be chosen):1. an intial pruning of dimensions to be used (since dimensions are user, it prunes supM):2. a pruning of movie voters, N, (in supU)3 a final pruning of dimensions (CoSupport(M,N) for the specific movie voter, N. E.g., parameters are specified for this final prune as follows:

[movie_voting Prune_Users_in_CoSupMN]method = UserCommonCoSupportPruneleftside = 0width = 8000mstrt = 0mstrt_mult = 0.0ustrt = 0ustrt_mult = 0.0TSa = -100TSb = -100Tdvp = -1Tdvs = -1Tvdp = -1Tvds = -1TD = -1TP = -1PPm = .1TV = -1TSD = -1Ch = 1Ct = 2

specifies type of prune (there are 3 types: UserPrune with a full range of possibilities;UserFastPrune with just PearsonCorrelation pruning; CommonCoSupportPrune which orders users, V, according to the size of their CommonCoSupport with U only (note that this is a correlation of sorts too.)

specify leftside (from Uid) of an ID interval prune of supMspecify the width of an ID interval prune of supM

specify starting movie (intercept and slope) for N loop

specify starting movie (intercept and slope) for V loop

specify PearsonCorr threshold (a=Amal, meaning: use Amal's table lookup)specify PearsonCorr threshold (b=bill, meaning: use bill's formula - note if there has been prior pruning this

will have a different value than Amal's)threshold "diff of vectors" population-based std_dev prunethreshold "diff of vectors"sample-based std_dev prunethreshold "vectorof diffs" population-based std_dev prunethreshold "vector of diffs"sample-based std_dev prunethreshold (Gaussian of) Euclidean distance based prunethreshold for (Gaussian of) 1perpendicular distance pruneexponent for (Gaussian of) 1perpendicular distance prunethreshold (Gaussian of) a variation based prunethreshold std_dev based prune

Picks odering for count-based prune below: 1=Amal_Pearson, 2=Bill_Pearson, etc.threshol for count based prune

Note: all thresholds are forsimilarities, not distance i.e., when we start with a distance we follow it with the Gaussian to make it a similarity or correlation.

Page 10: The $1,000,000 Netflix Contest

mpp-mpred.C1/** \file * * This contains the main entry point and contains the code for driving * the multi-process shared memory implementation of the vertical PTree * based predictor system. */

/* Standard includes. */#include <stdlib.h>#include <unistd.h>#include <stdio.h>#include <wait.h>#include <sys/types.h>#include <time.h>

/* Standard C++ includes. */#include <fstream>#include <iostream>#include <vector>

/* Local C++ includes. */#include "mppConfig.H"#include "PredictionConfig.H"#include "UserSet.H"#include "MovieSet.H"

#include "mpp.h"

using namespace std;

/* Definition of structures static to this module. */struct task_table { int pid; int movie; int predictions; time_t start;};

/* * The following two global variables define the two sets of PTree's * which will be used to carry out the predictions. * * The UserSet of PTree's have user rating PTree's across the vertical * axis of the table. Each rating is encoded using three PTree's.

* The MovieSet has movie rating PTree's across the vertical axis of * the table. Each movie is encoded using three PTree's. */UserSet Users;MovieSet Movies;

int topMovK = 5, verK = 50;

bool use_pearson_movies = false;

/* * The minimum user correlation required to be eligible to participate * in voting. */float Minimum_User_Correlation = 0.5;

float corData[17771];

unsigned short int supData[17771];string probe;

/* External functions. */extern int Mpred_User_Predict(mppConfig &, unsigned long int, vector <int> &, \ PTree &);

/** * Internal private function. * * This function prints the current status of the task table. It is * an encapsulation function for reducing the complexity of the * job_table function.

* In the case of either transaction a status table is printed out * which reflects the current progress of the simulation. * * \param max_slots The maximum number of subordinates process * which will be managed. * * \param table A pointer to the task table which is to * be changed. * * \param changed The slot number in the task table which is * being updated.

Page 11: The $1,000,000 Netflix Contest

mpp-mpred.C2 * \param reason A character pointer to a description string * indicating why the table is being updated. */extern void print_job_table(int max_slots, \ struct task_table const * const table, \ int const changed, char const * const reason){ auto int entry; auto time_t now = time(NULL);

fprintf(stdout, "Task status change: %s", ctime(&now)); fputs("\tSlot\t PID\tMovie\tUsers\n", stdout); fputs("\t----\t-----\t-----\t-----\n", stdout);

for (entry= 0; entry < max_slots; ++entry) { fprintf(stdout, "\t%-5d\t%5d\t%5d\t%5d", entry, \ table[entry].pid, table[entry].movie, \ table[entry].predictions); if ( entry == changed ) fprintf(stdout, "\t<- %s\n", reason); else fputs("\n", stdout); } fputs("\n", stdout); return;}

/** * Internal private function. * * This function maintains a table which correllates process ID's with * the movies they are processing, the total number of predictions * required per movie and the time required to process a movie. * * Depending on the value of the movie number arguement this function * either stores the relationship or retrieves the movie associated * with the PID. * * In the case of either transaction a status table is printed out * which reflects the current progress of the simulation. * * \param max_slots The maximum number of subordinate processes * which are under management.

* \param pid The process ID number. * \param movie_number A movie value of zero causes this function to locate * and return the PID of the subordinate slave process * which is processing the momvie. A non-zero value * causes the PID to be stored in the relationship array. * \param predictions This arguement is only referenced when an update * is made to the task table. This arguement is * the number of customer predictions to be made * for the movie being scheduled * \return No return values are defined. */

extern void job_table(int max_slots, int const pid, int const movie_number, \ int const predictions)

{ auto char msg[50]; auto int lp, changed = 0;

auto time_t now = time(NULL);

static int movie_count = 0, prediction_count = 0;

static bool first = true;

static struct task_table *table;

/* Initialize the process table on the first call. */ if ( first ) { size_t amt = max_slots * sizeof(struct task_table); table = (struct task_table *) malloc(amt); if ( table == NULL ) { fputs("Cannot allocate job table.\n", stderr); exit(1); }

for (lp= 0; lp < max_slots; ++lp) { table[lp].pid = 0; table[lp].movie = 0; table[lp].predictions = 0; table[lp].start = 0;

Page 12: The $1,000,000 Netflix Contest

mpp-mpred.C3 * \param reason A character pointer to a description string * indicating why the table is being updated. */extern void print_job_table(int max_slots, \ struct task_table const * const table, \ int const changed, char const * const reason){ auto int entry; auto time_t now = time(NULL);

fprintf(stdout, "Task status change: %s", ctime(&now)); fputs("\tSlot\t PID\tMovie\tUsers\n", stdout); fputs("\t----\t-----\t-----\t-----\n", stdout);

for (entry= 0; entry < max_slots; ++entry) { fprintf(stdout, "\t%-5d\t%5d\t%5d\t%5d", entry, \ table[entry].pid, table[entry].movie, \ table[entry].predictions); if ( entry == changed ) fprintf(stdout, "\t<- %s\n", reason); else fputs("\n", stdout); } fputs("\n", stdout); return;}

/** * Internal private function. * * This function maintains a table which correllates process ID's with * the movies they are processing, the total number of predictions * required per movie and the time required to process a movie. * * Depending on the value of the movie number arguement this function * either stores the relationship or retrieves the movie associated * with the PID. * * In the case of either transaction a status table is printed out * which reflects the current progress of the simulation. * * \param max_slots The maximum number of subordinate processes * which are under management.

* \param pid The process ID number. * \param movie_number A movie value of zero causes this function to locate * and return the PID of the subordinate slave process * which is processing the momvie. A non-zero value * causes the PID to be stored in the relationship array. * \param predictions This arguement is only referenced when an update * is made to the task table. This arguement is * the number of customer predictions to be made * for the movie being scheduled * \return No return values are defined. */

extern void job_table(int max_slots, int const pid, int const movie_number, \ int const predictions)

{ auto char msg[50]; auto int lp, changed = 0;

auto time_t now = time(NULL);

static int movie_count = 0, prediction_count = 0;

static bool first = true;

static struct task_table *table;

/* Initialize the process table on the first call. */ if ( first ) { size_t amt = max_slots * sizeof(struct task_table); table = (struct task_table *) malloc(amt); if ( table == NULL ) { fputs("Cannot allocate job table.\n", stderr); exit(1); }

for (lp= 0; lp < max_slots; ++lp) { table[lp].pid = 0; table[lp].movie = 0; table[lp].predictions = 0; table[lp].start = 0; }

Page 13: The $1,000,000 Netflix Contest

mpp-mpred.C4 first = false; } /* Add a task to the table. */ if ( movie_number != 0 ) { for (lp= 0; lp < max_slots; ++lp) { if ( table[lp].pid == 0 ) { changed = lp; table[lp].pid = pid; table[lp].movie = movie_number; table[lp].predictions = predictions; table[lp].start = now;

print_job_table(max_slots, table, changed, \ "Started"); fflush(stdout); return; } } }

/* Remove a task from the table. */ for (lp= 0; lp < max_slots; ++lp) { if ( table[lp].pid == pid ) { auto time_t run_time = time(NULL) - table[lp].start; auto float per_user = run_time; prediction_count += table[lp].predictions;

snprintf(msg, sizeof(msg), "Completed: %lu " \ "[%.2f/user] secs.", run_time, \ per_user/table[lp].predictions); print_job_table(max_slots, table, lp, msg);

table[lp].pid = 0; table[lp].movie = 0; table[lp].predictions = 0; table[lp].start = 0;

fprintf(stdout, "\tMovies: %5d\tPredictions: %d\n\n", \ ++movie_count, prediction_count); fflush(stdout); return; }}}

/** * Main program starts here. */int main(int argc, char **argv) {

/* The following variable controls whether or not movie predictions * are to be run in parallel, ie. each in its own process. */ auto bool have_input = false, single_threaded = true;

char snbufr[10];

int movie_count = 0; int max_process_slots, process_count = 0;

pid_t pid; time_t run_start, t1, t2;

string data_root = PTREEDATA"/";

string corr_root = data_root + "mv_corr/co_mv_"; string supp_root = data_root + "mv_supp/sp_mv_";

string ptree_set_id = data_root + "nf_us_mv_pt"; string ptree_set_idT = data_root + "nf_mv_us_pt";

ifstream inFile1; ifstream inFile2;

auto mppConfig config;

/* Option parsing. */ auto int gopt; while ( (gopt = getopt(argc, argv, "C:c:i:n:")) != EOF ) { switch ( gopt ) { case 'c': if ( !config.read_config(optarg) ) { fprintf(stderr, "%s: Cannot read " \ "standard configuration - " \ "%s\n", argv[0], optarg); exit(1); }

Page 14: The $1,000,000 Netflix Contest

mpp-mpred.C5 break; case 'C': if ( !config.read_cluster_config(optarg) ) { fprintf(stderr, "%s: Cannot read " \ "cluster configuration - " \ "%s\n", argv[0], optarg); exit(1); } break; case 'i': have_input = true; probe.assign(optarg); break; case 'n': single_threaded = false; max_process_slots = atoi(optarg); break; } }

if ( !have_input ) { fprintf(stderr, "%s: No input file specified.\n", argv[0]); return 1; }

if ( !config.is_standard_config() && !config.is_cluster_config() ) { fprintf(stderr, "%s: No configuration specified.\n", argv[0]); return 1; } fprintf(stderr, "%s: Vertical Rating Predictor - %s\n\n", argv[0], VERSION); fputs("Data files:\n", stderr); fprintf(stderr, "\tid:\t%s\n", ptree_set_id.c_str()); fprintf(stderr, "\tidT:\t%s\n", ptree_set_idT.c_str()); fprintf(stderr, "\tsupp:\t%s*\n", supp_root.c_str()); fprintf(stderr, "\tcorr:\t%s*\n\n", corr_root.c_str()); fprintf(stderr, "\tInput:\t%s\n\n", probe.c_str()); if ( single_threaded ) fputs("Mode: single-threaded\n", stderr); else fprintf(stderr, "Mode: %d way multi-processor\n", \ max_process_slots); if ( config.is_standard_config() ) { auto PredictionConfig *pcfg = config.get_standard_config(); fputs("\nPrediction configuration:\n", stderr); pcfg->print(stderr); }

/** Load the rating data as two separate sets of PTree's. */ t1=time(NULL);

fputs("Data load started.\n", stderr); fputs("\tUser ptrees - ", stderr); if ( !Users.load_binary() ) { fputs("\n\nFailed load.\n", stderr); return 1; } fputs("identities - ", stderr); if ( !Users.load_identities() ) { fputs("\n\nFailed load.\n", stderr); return 1; } fputs("completed.\n", stderr);

fputs("\tMovie ptrees - ", stderr); if ( !Movies.load_binary() ) { fputs("\n\nFailed load.\n", stderr); return 1; } fputs("completed.\n", stderr);

t2=time(NULL); fprintf(stderr, "Data load completed, time = %u\n\n", t2 - t1);

ifstream inFile; inFile.open(probe.c_str() );

char str[100]; int last_movie_id = 0, new_movie_id = 0; bool last_movie = true;

inFile>>str; string str1(str); str1.erase(str1.size()-1); new_movie_id = atoi(str1.c_str());

/* Start of loop over movies begins here. */ run_start = time(NULL);

for(int movie_cnt= 0; !inFile.eof(); movie_cnt++) { vector <int> probeUs;

Page 15: The $1,000,000 Netflix Contest

++movie_count; last_movie_id = new_movie_id; last_movie = true;

while( last_movie && (inFile>>str) ) { string str1(str); if (str1.at(str1.size() - 1) == ':') { str1.erase(str1.size() - 1); new_movie_id = atoi(str1.c_str()); last_movie = false; } else probeUs.push_back(atoi(str1.c_str())); }

/* M is the movie to be predicted. */ t1 = time(NULL); unsigned long int M = last_movie_id - 1;

/* read the pearson correlations for movies * NOTE using pearson not Perp * Try to find bes co-related movie set for * pmv */ snprintf(snbufr, sizeof(snbufr), "%d", last_movie_id); string sn(snbufr);

string outCorr1 = corr_root + sn + ".bin"; inFile1.open( outCorr1.c_str() );

string outSupp1 = supp_root + sn + ".bin"; inFile2.open( outSupp1.c_str() );

inFile1.read(reinterpret_cast<char*>(&corData), \ 17771*sizeof(float)); inFile2.read(reinterpret_cast<char*>(&supData), \ 17771*sizeof(short int)); inFile1.close(); inFile2.close();

/* Get the list of users who have rated this movie. */ auto PTree user_list = Movies.get_users(M);

/* Check to see if predictions of movies are * to be single-threaded. If so run the * movie prediction synchronously and then * skip to the next movie. */ if ( single_threaded ) { auto time_t now = time(NULL); auto float start = now; fprintf(stderr, "Starting movie: %d, " \ "Users: %d, ", M, probeUs.size());

Mpred_User_Predict(config, M, probeUs, user_list);

now = time(NULL); fprintf(stderr, "Completed: %2.0f " \ "[%.2f/user] secs.\n\n", now - start, \ (now - start)/probeUs.size()); continue; }

/* Start prediction for movie pmv for given * users in probeUser set. Fork a new process and * generate customer predictions in this new fork. */ if ( process_count < max_process_slots ) { pid = fork(); if ( pid == -1 ) { perror("FPP fork failed."); exit(1); }

/* Child - process movie and exit. */ if ( pid == 0 ) { Mpred_User_Predict(config, M, probeUs, \ user_list); _exit(0); }

/* Parent - update task table. */ ++process_count;

job_table(max_process_slots, pid, M, probeUs.size()); }

/* Wait for any child processes to complete. */ if ( process_count == max_process_slots ) { int status; pid = wait(&status); if ( pid == -1 ) { perror("FPP wait failed."); exit(1); }

--process_count; job_table(max_process_slots, pid, 0, 0);

if ( WIFEXITED(status) == 0 ) { fprintf(stderr, "\tError in movie, " \ "status = %d\n", \ WEXITSTATUS(status)); } } } /* Capture all remaining slave processes. */ do { int status; pid = wait(&status); if ( pid == -1 ) { fputs("No processes left.\n", stderr); process_count = 0; continue; } --process_count; job_table(max_process_slots, pid, 0, 0);

if ( WIFEXITED(status) == 0 ) { fprintf(stderr, "\tError in movie, " \ "status = %d\n", \ WEXITSTATUS(status)); } } while ( process_count > 0 ); inFile.close(); fputs("\nPredictions completed.\n", stderr); fprintf(stderr, "\tMovies: %d\n", movie_count); fprintf(stderr,"\tTime: %d\n", time(NULL) -run_start); return 0; }

mpp-mpred.C6

Page 16: The $1,000,000 Netflix Contest

/** \file * This file contains the driver code which * implements predictions of recommendations. */

/* Program compilation defines folloow. * * These defines enable and control generation of movie specific logfiles. * The MOVIE_LOGGING define needs to be enabled to turn on generation of * logfiles. Other defines increase the amount of output generated. */#if 0#define MOVIE_LOGGING#endif#if 0#define MEMORY_LOGGING#endif#if 0#define VOTE_LOGGING#endif

// Include files.#include <stdio.h>#include <time.h>// Standard C++ includes.#include <fstream>#include <iostream>#include <vector>#include <map>#include <utility>// Local C++ include files.#include <PTreeSet.H>#include "mppConfig.H"#include "UserSet.H"#include "MovieSet.H"/* Standard C include files. */#include "mpp.h"using namespace std;// External variables.extern int topMovK, verK;extern bool use_pearson_movies;extern float Minimum_User_Correlation;extern float corData[17771];extern unsigned short int supData[17771];extern string probe;

// CREATES, OPENS logfile if logging enabled, else NULL returned LOGGING#if defined(MOVIE_LOGGING)static inline FILE * open_logfile(string movie_number) { auto string logname("./Output/" + probe.substr(probe.find_last_of('/') + 1) + \ "_" + movie_number + ".log"); return(fopen(logname.c_str(), "w+")); }#else static inline FILE * open_logfile(string movie_number) {return NULL;}#endif

// ENABLING causes nearest nbr user voting to print for each prediction.#if defined(VOTE_LOGGING) static inline void print_votes( FILE *logfile, int user, double vote, double weight, \ double vRt, double VBar, double Ub, double voter_corr) { if ( logfile == NULL ) return; fprintf(logfile, "\t\tVote: %.2f\tWeight: %.2f\tUser: %d\n", vote, weight, user); fprintf(logfile, "\t\t\tvRt: %.2f\tVbar: %.2f\tUb: %.2f\n", vRt, VBar, Ub); fprintf(logfile, "\t\t\tCor: %.2f\n\n", voter_corr); return; }#else static inline void print_votes( FILE *logfile, int user, double vote, double weight,\ double vRt, double VBar, double Ub, double voter_corr){ return; }#endif

// Enabling prints amount of memory consumed against given starting pt.#if defined(MEMORY_LOGGING)static inline void log_memory(FILE *logfile, const char *fmt, void *start) { fprintf(logfile, fmt, (char *) sbrk(0) - (char *) start); return; }#elsestatic inline void log_memory(FILE *logfile __attribute__ ((unused)), \ const char *fmt __attribute__ ((unused)), \ void *start __attribute__ ((unused))) { return; }#endifextern int Mpred_User_Predict (mppConfig &config, unsigned long int M, \ vector <int> & user_list, PTree & M_support){ auto void *movie_memory_start; auto char snbufr[10]; auto time_t start_time = time(NULL); auto unsigned long int U; auto FILE *predictions; auto FILE *logfile; auto PredictionConfig *pcfg = NULL;

mpp-user.C1

Page 17: The $1,000,000 Netflix Contest

// OPEN log and prediction files. snprintf(snbufr, sizeof(snbufr), "%lu", Movies.get_identity(M)); string sn(snbufr); string outPredName("./Output/"+probe.substr(probe.find_last_of('/')+1) \ + "_" + sn + ".predict"); logfile = open_logfile(sn);

if ( (predictions = fopen(outPredName.c_str(), "w+")) == NULL ) { fputs("Cannot open prediction file.\n", stderr); return 0; }

fprintf(predictions, "%lu:\n", Movies.get_identity(M)); if ( logfile != NULL ) fflush(logfile);

/* * Write descriptor to output logfile and the number of the movie * to the prediction file. */ if ( logfile != NULL ) fprintf(logfile, "\nBeginning movie: %5d\tUsers: %d\t" \ "PID: %d\n", Movies.get_identity(M), user_list.size(),\ getpid());

if ( logfile != NULL ) movie_memory_start = sbrk(0);

/* Select eligible clusters for this movie. */ if ( config.is_cluster_config() ) config.select_clusters(Movies, M);

/* Loop over users starts here. */ for (unsigned int user= 0; user < user_list.size(); ++user) { auto double vote = DEFAULT_VOTE, VOTE = DEFAULT_VOTE, vote_wt = 0.0, VOTE_wt = 0.0;

U = Users.get_index(user_list[user]); auto PTree supportM(M_support), supportU = Users.get_movies(U);

supportM.clearbit(U); supportU.clearbit(M);

if ( supportM.get_count() < 1) { fprintf(predictions, "%.2f\n", vote); fflush(predictions); continue; }

/* Get configuration information. */ if ( config.is_standard_config() ) pcfg = config.get_standard_config(); if ( config.is_cluster_config() ) { pcfg = config.select_configuration(Users, U); config.show_selection(logfile); }

/* Config file needs: (mpp-user part) * External Pruning: * 1. Reset support in movie-vote call: yes, no. * * Voting selection: * 2. Set vote_wt: 0 <= vote_wt <= 1 * (VOTE_wt = 1 - vote_wt) * Forcing in Range: * 5. Select 0, 1 or 2 force_vote_in_ranges: * user-vote movie-VOTE */

/* User voting.*/ if ( pcfg->do_user_voting() ) vote = user_vote(pcfg, M, supportM, U, supportU); //if ( vote < 1 ) vote = 1; else if ( vote > 5 ) vote = 5;

/* Movie voting. */ if ( pcfg->do_movie_voting() ) VOTE = movie_vote(pcfg, M, supportM, U, supportU); //if ( VOTE < 1 ) VOTE = 1; else if ( VOTE > 5 ) VOTE = 5;

/* Set user_vote_weight here. */ vote_wt = pcfg->get_user_vote_weight(); VOTE_wt = 1.0 - vote_wt; vote = (vote * vote_wt + VOTE * VOTE_wt ) / \ (vote_wt + VOTE_wt);

mpp-user.C2

Page 18: The $1,000,000 Netflix Contest

//sumSCor=sumSCor/countdimMN; sumPCor=sumPCor/countdimMN; sumDCor=sumDCor/countdimMN; sumdimMN=sumdimMN/countdimMN;//sumsCor=sumsCor/countdimUV; sumpCor=sumpCor/countdimUV; sumdCor=sumdCor/countdimUV; sumdimUV=sumdimUV/countdimUV;// vote=(vote*sumdimUV + VOTE*sumdimMN)/(sumdimUV+sumdimMN);//auto double red=.4; vote=(vote*exp(-pow(Vsdp,2))+VOTE*exp(-red*pow(Nsdp,2)))/(exp(-pow(Vsdp,2))+exp(-red*pow(Nsdp,2)));//auto double red=1.0; vote=(vote*exp(-pow(Vsdp,2))+VOTE*red*exp(-pow(Nsdp,2)))/(exp(-pow(Vsdp,2))+red*exp(-pow(Nsdp,2)));// if ( sumsCor>sumSCor + 0.1 ){ vote=( vote*sumdimUV*(2+sumsCor)+VOTE*sumdimMN*(2+sumSCor) )/( sumdimUV*(2+sumsCor)+sumdimMN*(2+sumSCor)); }// vote=VOTE;// if ( Nsdp < 2.0 && Vsdp > 2.0 && sumSCor + .5 > sumsCor ) vote=VOTE;// if ( Nsdp < 0.5 && Vsdp > 2 ) vote=VOTE;// vote=(vote*exp(-pow(Vsdp,2) ) + VOTE*exp(-pow(Nsdp,2)))/(exp(-pow(Vsdp,2)) + exp(-pow(Nsdp,2))); //.937465(95)

// Final output occurs here. if ( (vote < 1) && (vote != DEFAULT_VOTE) ) vote = 1; if ( (vote > 5) && (vote != DEFAULT_VOTE) ) vote = 5; // force vote into range

fprintf(predictions, "%.2f\n", vote); fflush(predictions);

if (logfile != NULL) fprintf(logfile,"\tPrediction #%d: %0.1f\tuser: %u\t" \ "config: %s\n\n", user, vote, Users.get_identity(U), \ pcfg->get_name());

} // ULOOP end

if (logfile!=NULL) { float total_time = time(NULL) - start_time; fprintf(logfile,"Ending movie: %d\tTime: %.2f [%.2f/user] " \ "secs.\t", Movies.get_identity(M), total_time, \ (float) (total_time/user_list.size())); log_memory(logfile, "Memory: %d\n", movie_memory_start); fputs("\n", logfile); fclose(logfile); }

fclose(predictions);

return 0;} // MLOOP end

mpp-user.C3

Page 19: The $1,000,000 Netflix Contest

/** \file This file contains the implementation of the user voting function. */

/* Include files. */#include <stdio.h>#include <math.h>#include <PTree.H>#include "MovieSet.H"#include "UserSet.H"#include "mppConfig.H"#include "PredictionConfig.H"#include "mpp.h"

/* Config file needs: (user-vote part) * uCor Internal Pruning: * * 1. Select 0 or 1 of dvCorp, dvCors, vdCorp, vdCors, pCor, dCor, sCor * 1.1 For selected in 1, set Threshold: dvThrp, dvThrs, vdThrp, vdThrs, pThr, dThr, sThr, * Threshold defaults are: 0 0 0 0 0 0 0 * * * uCor vote weighting: (Default uCor=1. By selecting 1 of these, we reset uCor value to it.) * 2. Select 0 or 1 of dvCorp, dvCors, vdCorp, vdCors, pCor, dCor, sCor * * Standard Deviation Internal Pruning: (population/sample; diffference_of_vectors/vector_of_differences) * * 3. Select 0 or more of: dUVsdp, dUVsds, Vsdp_Usdp, Vsds_Usds * 3.1 Foreach selected in 2, set Threshold: dUVsdpThr, dUVsdsThr, Vsdp_UsdpThr, Vsds_UsdsThr * Threshold defaults are: 0 0 0 0 * * 3.2 Foreach selected in 2, set pow exp: dUVsdpExp, dUVsdsExp, Vsdp_UsdpExp, Vsds_UsdsExp * Power Exponent defaults are: -1 -1 -1 -1 * * External Pruning: * 4. Select 0 or more of: Prune_Movies_In_SupU, Prune_Users__In_SupM, Prune_Movies_In_CoSupUV * 4.1 Foreach selected in 2, select 1 of: Prune, FastPrune, CommonCoSupportPrune * * 4.2 Reset non-pruned support in 2nd: yes, no. * * 4.3 Foreach selected in 2, set parameter: mstrt, ustrt, TSa, TSb, Tdvp,Tdvs,Tvdp,Tvds,TD,TP,PPm,TV,TSD,Ch, Ct * Prune Parameter defaults are: 0 0 -100 -100 -1 -1 -1 -1 -1 -1 .1 -1 -1 1 no def * * Forcing in Range: * 5. Select 0 or more force_vote_in_range: in_Voter_LOOP after_Voter_LOOP before_return*/

User-vote.C1

Page 20: The $1,000,000 Netflix Contest

/** * Public function. * This function implements user voting. * * \param pcfg A pointer to the class containing the parameters * which configure the voting. * \param M The movie number for which a prediction is to be * made * * \param supportM The PTree identifying the support for the movie * to be predicted. * \param U The identity number of the user for which a * prediction is to be made. * \param supportU The Ptree identifying the support for the user * who a predication is being made for. * \return The recommended prediction. */

extern double user_vote(PredictionConfig *pcfg, unsigned long int M, \ PTree & supportM, unsigned long int U, \ PTree & supportU){ /* Enabled for boundary based prediction revisions. */#if 0 auto double z0IP55=0, z0IP44=0, z0IP33=0, z0IP22=0, z0IP11=0, z0IP15=0, z0IP14=0, z0IP13=0, z0IP12=0, z0IP51=0, z0IP41=0, z0IP31=0, z0IP21=0, z0IP25=0, z0IP24=0, z0IP23=0, z0IP52=0, z0IP42=0, z0IP32=0, z0IP35=0, z0IP34=0, z0IP53=0, z0IP43=0, z0IP45=0, z0IP54=0;#endif

auto double vote = DEFAULT_VOTE, vote_sum = 0, vote_cnt = 0;

auto double Vb, Ub, dsSq, uCor = 1;

struct pruning *internal_prune; struct external_prune *external_prune;

auto PTree supM = supportM, supU = supportU; supM.clearbit(U); supU.clearbit(M);

/* External pruning: PRUNE MOVIES supU */ external_prune = pcfg->get_user_Prune_Movies_in_SupU(); if ( external_prune->enabled ) { if( supU.get_count() > external_prune->params.Ct ) do_pruning(external_prune, M, U, supM, supU); supM.clearbit(U); supU.clearbit(M); if( (supM.get_count() < 1) || (supU.get_count() < 1) ) return vote; }

/* Reset user support if requested. */ if ( pcfg->reset_user_support() ) { supM = supportM; supM.clearbit(U); }

/* External pruning: Prune Users supM */ external_prune = pcfg->get_user_Prune_Users_in_SupM(); if ( external_prune->enabled ) { if ( supM.get_count() > external_prune->params.Ct ) do_pruning(external_prune, M, U, supM, supU); supM.clearbit(U); supU.clearbit(M); if( (supM.get_count() < 1) || (supU.get_count() < 1) ) return vote; }

/* VN: VLOOP strt (Vs are user voters)*/ auto unsigned long long int *supMlist = supM.get_indexes();

for (unsigned long long int v= 0; v < supM.get_count(); ++v) { auto unsigned long long int V = supMlist[v];

auto double MV = Users.get_rating(V, M) - 2, max = 0, smV = 0, smU = 0, UU = 0, UV = 0, VV = 0, dm;

User-vote.C2

Page 21: The $1,000,000 Netflix Contest

auto PTree csUV = supU & Users.get_movies(V); csUV.clearbit(M); dm = csUV.get_count(); if( dm < 1) continue;/* turn on only if doing Inner-Product Boundary-Based prediction revisions */#if 0 auto double S1=0, S2=0, S3=0, S4=0, S5=0, C1=0, C2=0, C3=0, C4=0, C5=0, A1=0, A2=0, A3=0, A4=0, A5=0, S11=0, S22=0, S33=0, S44=0, S55=0, C11=0, C22=0, C33=0, C44=0, C55=0, A11=0, A22=0, A33=0, A44=0, A55=0, smN=0, smM=0, NN=0, MN=0, MM=0;#endif /* External pruning: PRUNE MOVIES CoSupUV */ external_prune = pcfg->get_user_Prune_Movies_in_CoSupUV(); if ( external_prune->enabled ) { if( csUV.get_count() > external_prune->params.Ct ) do_pruning(external_prune, M, U, supM, csUV); csUV.clearbit(M); supM.clearbit(U); dm = csUV.get_count(); if( dm < 1 ) continue; }

/* VN: NLOOP strt (Ns are movie vector_space_dimensions) */ auto unsigned long long int *csUVlist = csUV.get_indexes(); for (unsigned long long int n= 0; n < csUV.get_count(); ++n) { auto unsigned long long int N = csUVlist[n]; auto double NU = Users.get_rating(U, N) - 2, NV = Users.get_rating(V, N) - 2; if( pow(NU-NV, 2) > max) max = pow(NU-NV, 2);

smV += NV; smU += NU; UU += NU * NU; UV += NU * NV; VV += NV * NV;

User-vote.C3

Vb = smV / dm; Ub = smU / dm; dsSq = VV - 2*UV + UU; vote = MV - Vb + Ub;

/* SAMPLE-statistic-based pruning through early exit. */ if( dm > 1) { /* method dUVsds */ internal_prune = pcfg->get_internal_prune(user_dUVsds); if ( internal_prune->enabled ) { auto double dUVsds, thr = internal_prune->threshold, expnt = internal_prune->exponent; dUVsds = pow((dsSq-dm*(Vb-Ub)*(Vb-Ub))/(dm-1),.5); if( dUVsds > (thr * pow(dm, expnt)) ) continue; } /* method Usds_Vsds. NO exponent. */ internal_prune = pcfg->get_internal_prune(user_Vsds,Usds); if ( internal_prune->enabled ) { auto double Usds, Vsds, thr=internal_prune->threshold; Usds = pow((UU-dm*Ub*Ub)/(dm-1), 0.5); Vsds = pow((VV-dm*Vb*Vb)/(dm-1), 0.5); if( Vsds > (thr * Usds) ) continue; }

//turn on only if doing Inner-Product Boundary-Based prediction revisions#if 0 if(NU==1&&NV>0){S1+=NV;++C1;}else{ if(NU==2&&NV>0){S2+=NV;++C2;}else{ if(NU==3&&NV>0){S3+=NV;++C3;}else{ if(NU==4&&NV>0){S4+=NV;++C4;}else{ if(NU==5&&NV>0){S5+=NV;++C5;} }}}}#endif}

/* e.g., -10 is exponent. */ /* e.g., 0 in if statement is threshold. */ internal_prune = pcfg->get_internal_prune(user_dvCors); if ( internal_prune->enabled ) { auto double dvCors, Usds, Vsds, thr = internal_prune->threshold, expnt = internal_prune->exponent; Usds = pow((UU-dm*Ub*Ub)/(dm-1), 0.5); Vsds = pow((VV-dm*Vb*Vb)/(dm-1), 0.5); dvCors = exp(expnt * (Vsds-Usds)*(Vsds-Usds)); if ( dvCors < thr ) continue; if ( internal_prune->weight ) uCor = dvCors; } internal_prune = pcfg->get_internal_prune(user_vdCors); if ( internal_prune->enabled ) { auto double vdCors, dUVsds, thr = internal_prune->threshold, expnt = internal_prune->exponent; dUVsds=pow((dsSq-dm*(Vb-Ub)*(Vb-Ub))/(dm-1),.5); vdCors = exp(expnt * dUVsds * dUVsds); if ( vdCors < thr ) continue; if ( internal_prune->weight ) uCor=vdCors; } }

Page 22: The $1,000,000 Netflix Contest

User-vote.C4

/* POPULATION-statistics-based pruning through early exit. */ if( dm > 0 ) { internal_prune = pcfg->get_internal_prune(user_dUVsdp); if ( internal_prune->enabled ) { auto double dUVsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent; dUVsdp=pow(dm*dsSq-(smV-smU)*(smV-smU),.5)/dm; if ( dUVsdp > thr * pow(dm, expnt) ) continue; }

/* method Usds_Vsds */ // Usdp=pow(dm*UU-smU*smU,.5)/dm; // Vsdp=pow(dm*VV-smV*smV,.5)/dm; // if( Vsdp > 0.5 * Usdp )continue; // Threshold is 0.5 // No exponent internal_prune = \ pcfg->get_internal_prune(user_Vsdp_Usdp); if ( internal_prune->enabled ) { auto double Usdp, Vsdp, thr = internal_prune->threshold; Usdp = pow(dm*UU - smU*smU, 0.5) / dm; Vsdp = pow(dm*VV - smV*smV, 0.5) / dm; if ( Vsdp > thr * Usdp ) continue; }

// e.g., Threshold: 0.9 // e.g., Exponent: -10 // dvCorp=exp(-10 *(Vsdp-Usdp) * (Vsdp-Usdp)); // if ( dvCorp < .9 ) continue; // uCor=dvCorp; internal_prune = pcfg->get_internal_prune(user_dvCorp);

if ( internal_prune->enabled ) { auto double dvCorp, Usdp, Vsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent; Usdp = pow(dm*UU - smU*smU, 0.5) / dm; Vsdp = pow(dm*VV - smV*smV, 0.5) / dm; dvCorp = exp(expnt * (Vsdp-Usdp)*(Vsdp-Usdp)); if ( dvCorp < thr ) continue; if ( internal_prune->weight ) uCor = dvCorp; }

internal_prune = pcfg->get_internal_prune(user_vdCorp);

if ( internal_prune->enabled ) { auto double vdCorp, dUVsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent;

dUVsdp = pow(dm*dsSq-(smV-smU)*(smV-smU), .5) \ / dm; vdCorp = exp(expnt * dUVsdp * dUVsdp);

if ( vdCorp < thr) continue; if ( internal_prune->weight ) uCor = vdCorp; } }

/* OTHER Correlation pruning * (pearson=s, pureshift=p, distance=d) */ internal_prune = pcfg->get_internal_prune(user_sCor); if ( internal_prune->enabled ) { auto double sCor, thr = internal_prune->threshold;

sCor = (UV - dm*Ub*Vb)/(.0001 + \ (pow((UU-dm*pow(Ub,2)),0.5))* \ (.0001+pow((VV-dm*pow(Vb,2)),.5))); if ( sCor < thr ) continue; if ( internal_prune->weight ) uCor = sCor; }

internal_prune = pcfg->get_internal_prune(user_pCor); if ( internal_prune->enabled ) { auto double OnePDS, pCor = -1, thr = internal_prune->threshold, expnt = internal_prune->exponent;

OnePDS = dsSq - dm*pow(Vb-Ub, 2); if ( max > 0 ) pCor=exp(expnt*OnePDS/(pow(max,.75)*pow(dm,.5))); if ( pCor < thr ) continue; if ( internal_prune->weight ) uCor = pCor; }

Page 23: The $1,000,000 Netflix Contest

User-vote.C5 internal_prune = pcfg->get_internal_prune(user_dCor); if ( internal_prune->enabled ) { auto double dCor, OnePDS, thr = internal_prune->threshold;

OnePDS = dsSq - dm*pow(Vb-Ub, 2); dCor = exp(-dsSq / 100); if ( dCor < thr ) continue; if ( internal_prune->weight ) uCor = dCor; }/* Turn on for boundary based predication revisions. */#if 0if(C1>0&&C2+C3+C4+C5>0) {A1=S1/C1; A11=(S2+S3+S4+S5)/(C2+C3+C4+C5); z0IP11+=(A1-((A1+A11)/2))*(MV-((A1+A11)/2));}if(C1>0&&C2>0) {A1=S1/C1; A2=S2/C2; z0IP12+=(A1-((A1+A2 )/2))*(MV-((A1+A2 )/2));}if(C1>0&&C3>0) {A1=S1/C1; A3=S3/C3; z0IP13+=(A1-((A1+A3 )/2))*(MV-((A1+A3 )/2));}if(C1>0&&C4>0) {A1=S1/C1; A4=S4/C4; z0IP14+=(A1-((A1+A4 )/2))*(MV-((A1+A4 )/2));}if(C1>0&&C5>0) {A1=S1/C1; A5=S5/C5; z0IP15+=(A1-((A1+A5 )/2))*(MV-((A1+A5 )/2));}z0IP51=-z0IP15; z0IP41=-z0IP14; z0IP31=-z0IP13; z0IP21=-z0IP12;

if(C2>0&&C1+C3+C4+C5>0) {A2=S2/C2; A22=(S1+S3+S4+S5)/(C1+C3+C4+C5); z0IP22+=(A2-((A2+A22)/2))*(MV-((A2+A22)/2));}if(C2>0&& C3>0) {A2=S2/C2; A3=S3/C3; z0IP23+=(A2-((A2+A3 )/2))*(MV-((A2+A3 )/2));}if(C2>0&& C4>0) {A2=S2/C2; A4=S4/C4; z0IP24+=(A2-((A2+A4 )/2))*(MV-((A2+A4 )/2));}if(C2>0&& C5>0) {A2=S2/C2; A5=S5/C5; z0IP25+=(A2-((A2+A5 )/2))*(MV-((A2+A5 )/2));}z0IP32=-z0IP23; z0IP42=-z0IP24; z0IP52=-z0IP25;

if(C3>0&&C1+C2+C4+C5>0) {A3=S3/C3; A33=(S1+S2+S4+S5)/(C1+C2+C4+C5); z0IP33+=(A3-((A3+A33)/2))*(MV-((A3+A33)/2));}if(C3>0&& C4>0) {A3=S3/C3; A4=S4/C4; z0IP34+=(A3-((A3+A4 )/2))*(MV-((A3+A4 )/2));}if(C3>0&& C5>0) {A3=S3/C3; A5=S5/C5; z0IP35+=(A3-((A3+A5 )/2))*(MV-((A3+A5 )/2));}z0IP43=-z0IP34; z0IP53=-z0IP35;

if(C4>0&&C1+C2+C3+C5>0) {A4=S4/C4; A44=(S1+S2+S3+S5)/(C1+C2+C3+C5); z0IP44+=(A4-((A4+A44)/2))*(MV-((A4+A44)/2));}if(C4>0&& C5>0) {A4=S4/C4; A5=S5/C5; z0IP45+=(A4-((A4+A5 )/2))*(MV-((A4+A5 )/2));}z0IP54=-z0IP45;

if(C5>0&&C1+C2+C3+C4>0) {A5=S5/C5; A55=(S1+S2+S3+S4)/(C1+C2+C3+C4); z0IP55+=(A5-((A5+A55)/2))*(MV-((A5+A55)/2));}

//auto double MU = Users.get_rating(U,M)-2; fprintf(stderr,"MU=%1.0f %8.1f %8.1f %8.1f \n", MU,z0IP55,z0IP11,z0IP51);//auto double MU = Users.get_rating(U,M)-2; fprintf(stderr,"MU=%1.0f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f \

%5.1f\n",MU,z0IP11,z0IP22,z0IP33,z0IP44,z0IP55,z0IP12,z0IP13,z0IP14,z0IP15,z0IP23,z0IP24,z0IP25,z0IP34,z0IP35,z0IP45);#endif

Page 24: The $1,000,000 Netflix Contest

User-vote.C6

if ( uCor > 0 ) { vote_sum += vote*uCor; vote_cnt += uCor; } else continue; /* Check and implement forcing of vote in the user loop. */ if ( pcfg->user_vote_force_in_loop() ) { if( (vote < 1) && (vote != DEFAULT_VOTE) ) vote = 1; if( (vote > 5) && (vote != DEFAULT_VOTE) ) vote = 5; } } if ( vote_cnt > 0 ) vote = vote_sum / vote_cnt; else vote = DEFAULT_VOTE;

/* force_vote_after_Voter_Loop goes here. */ if ( pcfg->user_vote_force_after_loop() ) { if( (vote < 1) && (vote != DEFAULT_VOTE) ) vote=1; if( (vote > 5) && (vote != DEFAULT_VOTE) ) vote=5; }/* Turn on only if doing Inner-Product Boundary-Based prediction revisions. */#if 0//Boundary-Based-Inner-Product vote CHANGE startif ( z0IP55>-.01 //&& z0IP55> z0IP33 && z0IP55> z0IP44 && z0IP51>-.01 //&& z0IP52> .1 && z0IP53> THRZ0 && z0IP54> THRZ0) vote=5;#endif#if 0 //Boundary-Based-Inner-Product vote CHANGE start auto double FACZ0=-0.1, THRZ0=-0.1 ; //fauto double FACZ0= 0.40, THRZ0=0.7, z0IP51=-z0IP15, z0IP52=-z0IP25, z0IP53=-z0IP35, z0IP54=-z0IP54;#if 1 //Change vote to 5?if ( true && z0IP55> FACZ0 + z0IP11 && z0IP55> FACZ0+z0IP22 && z0IP55> FACZ0+z0IP33 && z0IP55> FACZ0 + z0IP44 && z0IP51> THRZ0 && z0IP52> THRZ0 && z0IP53> THRZ0 && z0IP54> THRZ0) vote=5;#endif#if 1 //Change vote to 1?if ( true && z0IP11>(FACZ0 )*z0IP22 && z0IP11>(FACZ0 )*z0IP33 && z0IP11>(FACZ0 )*z0IP44 && z0IP11>(FACZ0 )*z0IP55 && z0IP12> THRZ0 && z0IP13> THRZ0 && z0IP14> THRZ0 && z0IP15> THRZ0 ) vote=1;#endif#endif //Boundary-Based-Inner-Product vote CHANGE end return vote;}

Page 25: The $1,000,000 Netflix Contest

/** \file This file contains the implementation of the movie voting algorithem. *//* Include files. */#include <stdio.h>#include <PTree.H>#include "MovieSet.H"#include "UserSet.H"#include "mppConfig.H"#include "PredictionConfig.H"#include "mpp.h"

/* Config file needs: (movie-vote part) * UCor Internal Pruning: * 1. Select 0 or 1 of DVCorp, DVCors, VDCorp, VDCors, PCor, DCor, SCor * 1.1 For selected in 1, set Threshold: DVThrp, DVThrs, VDThrp, VDThrs, PThr, DThr, SThr * Threshold defaults are: 0 0 0 0 0 0 0 * UCor VOTE weighting: (Default is UCor=1. By selecting 1 of these, we reset UCor's value to it.) * 2. Select 0 or 1 of DVCorp, DVCors, VDCorp, VDCors, PCor, DCor, SCor * Standard Deviation Internal Pruning: (population/sample; diffference_of_vectors/vector_of_differences) * 3. Select 0 or more of: dMNsdp, dMNsds, Nsdp_Msdp, Nsds_Msds * 3.1 Foreach selected in 2, set Threshold: dMNsdpThr, dMNsdsThr, Nsdp_MsdpThr, Nsds_MsdsThr * Threshold defaults are: 0 0 0 0 * 3.2 Foreach selected in 2, set pow exp: dMNsdpExp, dMNsdsExp, Nsdp_MsdpExp, Nsds_MsdsExp * Power Exponent defaults are: -1 -1 -1 -1 * External Pruning: * 4. Select 0 or more of: Prune_Users_In_SupM, Prune_Movies_In_SupU, Prune_Users_In_CoSupMN * 4.1 Foreach selected in 2, select 1 of: Prune, FastPrune, CommonCoSupportPrune * 4.2 Reset non-pruned support in 2nd: yes, no. * 4.3 Foreach selected in 2, set parameter: mstrt, ustrt, TSa, TSb, Tdvp,Tdvs,Tvdp,Tvds,TD,TP,PPm,TV,TSD,Ch, Ct * Prune Parameter defaults are: 0 0 -100 -100 -1 -1 -1 -1 -1 -1 .1 -1 -1 1 no def * Forcing in Range: * 5. Select 0,1 or 2 force_vote_in_ranges: in_Voter_LOOP(for each voter) outside_Voter_LOOP (for composite VOTE)*/

/** * Public function. * This function implements movie voting. * \param pcfg A pointer to the class containing the parameters * which configure the voting. * \param M The movie number for which a prediction is to be made * \param supportM The PTree identifying the support for the movie to be predicted. * \param U The identity number of the user for which a prediction is to be made. * \param supportU The Ptree identifying the support for the user who a predication is being made for. * \return The recommended prediction. */

movie-vote.C1

Page 26: The $1,000,000 Netflix Contest

movie-vote.C2

extern double movie_vote(PredictionConfig *pcfg, unsigned long int M, \ PTree & supportM, unsigned long int U, \ PTree & supportU)

{ auto double vote = DEFAULT_VOTE, VOTE = DEFAULT_VOTE, VOTE_sum = 0, VOTE_cnt = 0; auto double Nb, Mb, dsSq, UCor = 1; struct pruning *internal_prune; struct external_prune *external_prune;

auto PTree supM = supportM, supU = supportU; supM.clearbit(U); supU.clearbit(M);

/* External pruning: Prune Users supM */ external_prune = pcfg->get_movie_Prune_Users_in_SupM(); if ( external_prune->enabled ) { if( supM.get_count() > external_prune->params.Ct) do_pruning(external_prune, M, U, supM, supU); supM.clearbit(U); supU.clearbit(M); if ( (supM.get_count() < 1) || (supU.get_count() < 1) ) return vote; }

/* Reset support if requested. */ if ( pcfg->reset_movie_support() ) { supU = supportU; supU.clearbit(M); }

/* External pruning: Prune Movies supU */ external_prune = pcfg->get_movie_Prune_Movies_in_SupU(); if ( external_prune->enabled ) { if( supU.get_count() > external_prune->params.Ct ) do_pruning(external_prune, M, U, supM, supU); supM.clearbit(U); supU.clearbit(M); if( (supM.get_count() < 1) || (supU.get_count() < 1) ) return vote; }

/* NV: NLOOP strt (Ns are movie voters) */ auto unsigned long long int *supUlist = supU.get_indexes(); for (unsigned long long int nn= 0; nn < supU.get_count(); ++nn) { auto unsigned long long int N = supUlist[nn];

auto double NU = Users.get_rating(U,N)-2, MAX = 0, smN = 0, smM = 0, MM = 0, MN = 0, NN = 0, dm;

auto PTree csMN = supM & Movies.get_users(N); csMN.clearbit(U); dm = csMN.get_count(); if( dm < 1 ) continue;

/* External pruning: PRUNE USERS CoSupMN */ external_prune = pcfg->get_movie_Prune_Users_in_CoSupMN(); if ( external_prune->enabled ) { if( csMN.get_count() > external_prune->params.Ct) do_pruning(external_prune, M, U, csMN, supU); csMN.clearbit(U); supU.clearbit(M); dm = csMN.get_count(); if( dm < 1) continue; }

/* NV: VLOOP strt (Vs are user vector_space_dimensions) */ auto unsigned long long int *csMNlist = csMN.get_indexes();

for (unsigned long long int v= 0; v < csMN.get_count(); ++v) { auto unsigned long long int V = csMNlist[v];

auto double MV = Users.get_rating(V,M) - 2, NV = Users.get_rating(V,N) - 2;

if( pow(MV-NV, 2) > MAX ) MAX = pow(MV-NV, 2); smN += NV; smM += MV; MM += MV * MV; MN += NV * MV; NN += NV * NV; }

Nb = smN / dm; Mb = smM / dm; dsSq = NN - 2*MN + MM; VOTE = NU - Nb + Mb;

Page 27: The $1,000,000 Netflix Contest

movie-vote.C3

/* force_vote_in_Voter_Loop goes here. */ if ( pcfg->movie_vote_force_in_loop() ) { if ( (VOTE < 1) && (VOTE != DEFAULT_VOTE) ) VOTE=1; if ( (VOTE > 5) && (VOTE != DEFAULT_VOTE) ) VOTE=5; } /* SAMPLE-statistic-based pruning through early exit. */ if( dm > 1 ) { /* method dMNsds */ internal_prune = \ pcfg->get_internal_prune(movie_dMNsds); if ( internal_prune->enabled ) { auto double dMNsds, thr = internal_prune->threshold, expnt = internal_prune->exponent;

dMNsds = pow((dsSq-dm*(Nb-Mb)*(Nb-Mb))/(dm-1),\ 0.5); if( dMNsds > (thr * pow(dm, expnt)) ) continue; }

/* method Msds_Nsds NO exponent. */ internal_prune = \ pcfg->get_internal_prune(movie_Nsds_Msds); if ( internal_prune->enabled ) { auto double Msds, Nsds, thr = internal_prune->threshold; Msds = pow((MM-dm*Mb*Mb)/(dm-1), 0.5); Nsds = pow((NN-dm*Nb*Nb)/(dm-1), 0.5); if ( Nsds > (thr * Msds) ) continue; }

internal_prune = \ pcfg->get_internal_prune(movie_DVCors); if ( internal_prune->enabled ) { auto double Msds, Nsds, DVCors, thr = internal_prune->threshold, expnt = internal_prune->exponent;

Msds = pow(dm*MM - smM*smM, 0.5) / dm; Nsds = pow(dm*NN - smN*smN, 0.5) / dm; DVCors = exp(expnt * (Nsds-Msds)*(Nsds-Msds)); if ( DVCors < thr ) continue; if ( internal_prune->weight ) UCor = DVCors; }

internal_prune = \ pcfg->get_internal_prune(movie_VDCors); if ( internal_prune->enabled ) { auto double VDCors, dMNsds, thr = internal_prune->threshold, expnt = internal_prune->exponent;

dMNsds=pow((dsSq-dm*(Nb-Mb)*(Nb-Mb))/(dm-1),.5);

VDCors = exp(expnt * dMNsds * dMNsds); if ( VDCors < thr ) continue; if ( internal_prune->weight ) UCor = VDCors; } } /* POPULATION-statistics-based pruning through early exit. */ if ( dm > 0 ) { internal_prune = \ pcfg->get_internal_prune(movie_dMNsdp); if ( internal_prune->enabled ) { auto double dMNsdp,thr=internal_prune->threshold;

dMNsdp=pow(dm*dsSq-(smN-smM)*(smN-smM),.5)/dm; if ( dMNsdp > (thr * pow(dm,0.9)) ) continue; } /* method Usds_Vsds */ internal_prune = \ pcfg->get_internal_prune(movie_Nsdp_Msdp); if ( internal_prune->enabled ) { auto double Nsdp, Msdp, thr = internal_prune->threshold; Msdp = pow(dm*MM - smM*smM, 0.5) / dm; Nsdp = pow(dm*NN - smN*smN, 0.5) / dm; if( Nsdp > (thr * Msdp) ) continue; } internal_prune = \ pcfg->get_internal_prune(movie_VDCorp); if ( internal_prune->enabled ) { auto double DVCorp, Msdp, Nsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent; Msdp = pow(dm*MM - smM*smM, 0.5) / dm; Nsdp = pow(dm*NN - smN*smN, 0.5) / dm; DVCorp = exp(expnt * (Nsdp-Msdp)*(Nsdp-Msdp)); if ( DVCorp < thr ) continue; if ( internal_prune->weight ) UCor = DVCorp; }

Page 28: The $1,000,000 Netflix Contest

movie-vote.C4

if ( internal_prune->enabled ) { auto double VDCorp, dMNsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent;

dMNsdp=pow(dm*dsSq-(smN-smM)*(smN-smM),.5)/dm; VDCorp = exp(expnt * dMNsdp * dMNsdp); if ( VDCorp < thr ) continue; if ( internal_prune->weight ) UCor = VDCorp; } }

/* OTHER Correlation pruning (pearson=s,pureshift=p,distance=d)*/ internal_prune = pcfg->get_internal_prune(movie_SCor); if ( internal_prune->enabled ) { auto double SCor, thr=internal_prune->threshold; SCor= (MN-dm*Mb*Nb)/(.0001+(pow((MM-dm*pow(Mb,2)),.5)) * (.0001+pow((NN-dm*pow(Nb, 2)),.5))); if ( SCor < thr ) continue; if ( internal_prune->weight ) UCor = SCor; }

/* CHECK for exponent */ internal_prune = pcfg->get_internal_prune(movie_PCor); if ( internal_prune->enabled ) { auto double ONEPDS, PCor = 1, thr = internal_prune->threshold; ONEPDS = dsSq - dm * pow(Nb-Mb, 2); if (MAX>0) PCor=exp(-.1*ONEPDS/(pow(MAX,.75)*pow(dm,.5))); if( PCor < thr ) continue; if ( internal_prune->weight ) UCor = PCor; }

internal_prune = pcfg->get_internal_prune(movie_DCor); if ( internal_prune->enabled ) { auto double DCor, ONEPDS, thr = internal_prune->threshold; ONEPDS = dsSq - dm*pow(Nb-Mb, 2); DCor = exp(-dsSq / 100); if ( DCor < thr ) continue; if ( internal_prune->weight ) UCor = DCor; } if (UCor>0) {VOTE_sum += VOTE*UCor; VOTE_cnt+=UCor; } else continue;

/* force_vote_in_Voter_Loop goes here. */ if ( pcfg->movie_vote_force_in_loop() ) { if ( (VOTE < 1) && (VOTE != DEFAULT_VOTE) ) VOTE=1; if ( (VOTE > 5) && (VOTE != DEFAULT_VOTE) ) VOTE=5; }

}

if ( VOTE_cnt > 0 ) VOTE = VOTE_sum / VOTE_cnt; else VOTE = DEFAULT_VOTE;

/* force_vote_after_Voter_Loop goes here. */ if ( pcfg->movie_vote_force_after_loop() ) { if ( (VOTE < 1) && (VOTE != DEFAULT_VOTE) ) VOTE=1; if ( (VOTE > 5) && (VOTE != DEFAULT_VOTE) ) VOTE=5; }

return VOTE;}

Page 29: The $1,000,000 Netflix Contest

Prune.C1/** \file contains implementations of routines * for pruning user and movie voting lists. */

/* Standard C++ include files. */#include <map>#include <vector>#include <unistd.h>#include <stdlib.h>

/* Local C++ include files. */#include <PTree.H>#include "UserSet.H"#include "MovieSet.H"#include "mppConfig.H"#include "mpp.h"

/* Global accessible variables. */extern float corData[17771];

using namespace std;

/* Shorthand type definition for the correlation map. */typedef multimap<double, unsigned long long int, greater<double > > map_t;

/* Private function. * * This function loads a vector with a list of support indexes from * the given PTree. The list contains N elements where N is the support * count. The actual order of the list is determined by the start and * multiplier values passed in from the caller. * * \param suptree A reference to PTree whose support list is to be generated. * \param list A reference to vector loaded with support indexes. * \param start The starting element in the support list which * will be 0th element in the completed support list. * \param mult The multiplier value to be used in determining * the support starting point. */static void load_support_vector(PTree & suptree, \ vector<unsigned long long int> & list, \ unsigned long long int start, double mult){ auto unsigned long long int *indexes = suptree.get_indexes(), supcnt = suptree.get_count();

/* Set the starting point based on the specificed start point * and a multiplier if it is specified. If the starting point * exceeds the support count start at the beginning of the * support list. */ start = start + (unsigned long long int) (mult * supcnt); if ( start > supcnt ) start = (unsigned long long int) (mult * supcnt); if ( start > supcnt ) start = 0;

/* The simple case is a start of zero. */ if ( start == 0 ) { for (unsigned long long int lp= 0; lp < supcnt; ++lp) list.push_back(indexes[lp]); } /* Two loop passes are needed for a non-zero start value. */ for (unsigned long long int lp= start; lp < supcnt; ++lp) list.push_back(indexes[lp]);

for (unsigned long long int lp= 0; lp < start; ++lp) list.push_back(indexes[lp]); return;}

/* Private function. * This function verifies whether or not a voting entity is within a * selection window. A selection window is defined by a minimum (leftside) * voter window and a window size. * \param voter The voter being considered. * * \param pp A pointer to the structure containing the * leftside and width parameters for a pruning method. * \return A boolean value is returned if the voter is * within the selection window. A false value * is automatically returned if the width value * is set to zero. Setting the width value to * zero thus disables window based selection. */static bool outside_window(unsigned long long int voter, \ struct pruning_parameters *pp){ if ( pp->width == 0 ) return false; if ( voter < pp->leftside ) return true; if ( voter > pp->leftside + pp->width ) return true; return false;}

Page 30: The $1,000,000 Netflix Contest

Prune.C2/* Private function. * This function implements the final step in 'pruning' of a PTree. It * clears the destination PTree and then sets only those bits in the PTree * which have been selected by a previous correlation strategy. * \param tree A reference to the PTree which is reflect the * contents of the multimap. * \param index_map The map specifying the index bits to be set. * \param max_count Maximum number of indexes to be selected from PTree. */static void load_ptree(PTree & tree, map_t index_map, double max_count){ map_t::iterator index_ptr = index_map.begin(); if ( index_map.size() < max_count ) max_count = index_map.size(); tree.clearall(); for (unsigned int lp= 0; lp < max_count; ++lp) { tree.setbit(index_ptr->second); ++index_ptr; } return;}

/* Movie prune standard. *//* movie_vote: Prune */static void mPrune(unsigned long long int M, PTree & supM, PTree & supU, struct pruning_parameters *pp){ if ( supU.get_count() < (pp->Ct + 1) ) return; map_t corRm; auto vector<unsigned long long int> support;

/* moviePRUNE (NV loops) NLOOP start */ load_support_vector(supU, support, pp->mstrt, pp->mstrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int N = support[lp];

if ( outside_window(N, pp) ) continue; auto double smM = 0, smN = 0, MM = 0, NN = 0, MN = 0, MV, NV, max=0, dm, Mb, Nb, dsSq, OnePDS, Nsdp = 0, Msdp = 0, Nsds = 0, Msds = 0, dMNsdp = 0, dMNsds = 0, mCor = 1, sCor = 1, dCor = 1, pCor = 1, vCor = 1, stdCor = 1, dvCorp = 1, dvCors = 1, vdCorp = 1, vdCors = 1;

auto PTree csMN = supM&Movies.get_users(N); if( csMN.get_count() < 1 ) continue; /* moviePRUNE (NV loops) VLOOP start */ auto vector<unsigned long long int> ilp; load_support_vector(csMN, ilp, pp->ustrt, pp->ustrt_mult); for (unsigned long long int lp1= 0; lp1 < ilp.size(); ++lp1) { auto unsigned long long int V = ilp[lp1];#if 0 if ( outside_window(V, pp) ) continue;#endif MV = Movies.get_rating(V, M) - 2; NV = Movies.get_rating(V, N) - 2; if(pow(MV-NV,2)>max) max=pow(MV-NV,2); smM += MV; smN += NV; MM += MV*MV; NN += NV*NV; MN += MV*NV; } dm=csMN.get_count(), Mb=smM/dm, Nb=smN/dm, dsSq=NN-2*MN+MM, OnePDS=dsSq-dm*pow(Nb-Mb,2), sCor=(MN-dm*Mb*Nb)/(.0001+ (pow((MM-dm*pow(Mb,2)),.5))*(pow((NN-dm*pow(Nb,2)),.5))), dCor=exp(-dsSq/100), pCor=1; if(max>0)pCor=exp(-pp->PPm*OnePDS/(.0001+pow(max,.75)*pow(dm,.5))); if(dm>0){Nsdp=pow(dm*NN-smN*smN,.5)/dm; Msdp=pow(dm*MM-smM*smM,.5)/dm; dMNsdp=pow(dm*dsSq-(smN-smM)*(smN-smM),.5)/dm;} if(dm>1){Nsds=pow((NN-dm*Nb*Nb)/(dm-1),.5); Msds=pow((MM-dm*Mb*Mb)/(dm-1),.5); dMNsds=pow((dsSq-dm*(Nb-Mb)*(Nb-Mb))/(dm-1),.5);} dvCorp=exp(-10 * (Nsdp-Msdp) * (Nsdp-Msdp) ); dvCors=exp(-10 * (Nsds-Msds) * (Nsds-Msds) ); vdCorp=exp(-10 * dMNsdp * dMNsdp ); vdCors=exp(-10 * dMNsds * dMNsds );

if( pp->Ch == 1) mCor = corData[N+1]; if( pp->Ch == 2) mCor = sCor; if( pp->Ch == 3) mCor = dCor; if( pp->Ch == 4) mCor = pCor; if( pp->Ch == 5) mCor=vCor; if( pp->Ch == 6) mCor = stdCor; if( pp->Ch == 7 ) mCor = dvCorp; if( pp->Ch == 8 ) mCor = dvCors; if( pp->Ch == 9 ) mCor = vdCorp; if( pp->Ch == 0 ) mCor = vdCors; // THRESHOLD PRUNING if ( corData[N+1] < pp->TSa || sCor < pp->TSb || \ pCor < pp->TP || dCor < pp->TD || vCor < pp->TV || \ stdCor < pp->TSD || dvCorp < pp->Tdvp || \ dvCors < pp->Tdvs || vdCorp < pp->Tvdp || vdCors < pp->Tvds )

Page 31: The $1,000,000 Netflix Contest

Prune.C3 else { auto pair<double,unsigned long long int> entry(mCor,N); corRm.insert(entry); } } if ( corRm.size() == 0 ) return; load_ptree(supU, corRm, pp->Ct); return;}/* movie_vote: FastPrune */static void fmPruneS(PTree & supU, struct pruning_parameters *pp){ if ( supU.get_count() < pp->Ct + 1 ) return; map_t corRm; auto vector<unsigned long long int> support;

/* moviePRUNE (NV loops) NLOOP start */ load_support_vector(supU, support, pp->mstrt, pp->mstrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int N = support[lp];#if 0 if ( outside_window(N, pp) ) continue;#endif if( corData[N+1] < pp->TSa ) continue; auto pair<double, unsigned long long int> \ entry(corData[N+1], N); corRm.insert(entry); } if ( corRm.size() == 0 ) return; load_ptree(supU, corRm, pp->Ct); return;}//userPRUNE (VN loops) start/* user_vote: Prune */static void uPrune (unsigned long long int U, PTree & supM, PTree & supU, \ struct pruning_parameters *pp){ if ( supM.get_count() < pp->Ct + 1) return; map_t corR; auto vector<unsigned long long int> support;

/* userPrune (VN loops) VLOOP start */ load_support_vector(supM, support, pp->ustrt, pp->ustrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int V = support[lp]; if ( outside_window(V, pp) ) continue;

auto double smU=0, smV=0, UU=0, VV=0, UV=0, max=0, Vsdp=0, Usdp=0, Vsds=0, Usds=0, dUVsdp=0, dUVsds=0, mCor=1, sCor=1, dCor=1, pCor=1, vCor=1, stdCor=1, dvCorp=1, dvCors=1, vdCorp=1, vdCors=1, NU, NV, dm, Ub, Vb, dsSq, OnePDS; auto PTree csUV = supU & Users.get_movies(V); if( csUV.get_count() < 1 ) continue; /* user PRUNE (VN loops) NLOOP start */ auto vector<unsigned long long int> ilp; load_support_vector(csUV, ilp, pp->mstrt, pp->mstrt_mult); for (unsigned long long int lp1= 0; lp1 < ilp.size(); ++lp1) { auto unsigned long long int N = ilp[lp1];#if 0 if ( outside_window(N, pp) ) continue;#endif NU = Movies.get_rating(U, N) - 2; NV = Movies.get_rating(V, N) - 2; if ( pow(NU-NV,2) > max ) max=pow(NU-NV, 2); smU += NU; smV += NV; UU += NU*NU; VV += NV*NV; UV += NU*NV; } //user PRUNE (VN loops) NLOOP end dm = csUV.get_count(); Ub = smU/dm; Vb = smV/dm; dsSq = VV - 2*UV + UU; OnePDS = dsSq - dm*pow(Vb-Ub,2);

sCor=(UV-dm*Ub*Vb)/((pow((UU-dm*pow(Ub,2)),.5))*(pow((VV-dm*pow(Vb,2)),.5))); dCor = exp(-dsSq/100); if (max>0) pCor=exp(-pp->PPm*OnePDS/(pow(max,.75)*pow(dm,.5))); if(dm>0){ Vsdp=pow(dm*VV-smV*smV,.5)/dm; Usdp=pow(dm*UU-smU*smU,.5)/dm; dUVsdp=pow(dm*dsSq-(smV-smU)*(smV-smU),.5)/dm;} if(dm>1){ Vsds=pow((VV-dm*Vb*Vb)/(dm-1),.5); Usds=pow((UU-dm*Ub*Ub)/(dm-1),.5); dUVsds=pow((dsSq-dm*(Vb-Ub)*(Vb-Ub))/(dm-1),.5);} dvCorp=exp(-10 * (Vsdp-Usdp) * (Vsdp-Usdp) ); dvCors=exp(-10 * (Vsds-Usds) * (Vsds-Usds) ); vdCorp=exp(-10 * dUVsdp * dUVsdp ); vdCors=exp(-10 * dUVsds * dUVsds ); if( pp->Ch == 1 ) mCor = sCor; if( pp->Ch == 2 ) mCor = sCor; if( pp->Ch == 3 ) mCor = dCor; if( pp->Ch == 4 ) mCor = pCor; if( pp->Ch == 5 ) mCor = vCor; if( pp->Ch == 6 ) mCor = stdCor; if( pp->Ch == 7 ) mCor = dvCorp; if( pp->Ch == 8 ) mCor = dvCors; if( pp->Ch == 9 ) mCor = vdCorp; if( pp->Ch == 0) mCor = vdCors;

Page 32: The $1,000,000 Netflix Contest

// THRESHOLD PRUNE if ( sCor < pp->TSb || pCor < pp->TP || dCor < pp->TD || vCor < pp->TV || stdCor < pp->TSD || dvCorp < pp->Tdvp|| dvCors < pp->Tdvs|| vdCorp < pp->Tvdp|| vdCors < pp->Tvds) continue; else { auto pair<double,unsigned long long int> entry(mCor,V); corR.insert(entry); } } if ( corR.size() == 0 ) return; load_ptree(supM, corR, pp->Ct); return; }/* user_vote: FastPrune */static void fuPruneS(unsigned long long int U, PTree & supM, PTree & supU, \ struct pruning_parameters *pp){ if ( supM.get_count() < (pp->Ct + 1) ) return; map_t corR; auto vector<unsigned long long int> support;

load_support_vector(supM, support, pp->ustrt, pp->ustrt_mult);

for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int V = support[lp]; if ( outside_window(V, pp) ) continue; auto PTree csUV = supU & Users.get_movies(V); if ( csUV.get_count() < 1 ) continue; auto double smU = 0, smV = 0, UU = 0, VV = 0, UV = 0, NU, NV; /* fast user Prune (VN loops) NLOOP start */ auto vector<unsigned long long int> ilp; load_support_vector(csUV, ilp, pp->mstrt, pp->mstrt_mult);

for(unsigned long long int lp1= 0; lp1 < ilp.size(); ++lp1) { auto unsigned long long int N = ilp[lp1];#if 0 if ( outside_window(N, pp) ) continue;#endif NU = Movies.get_rating(U, N) - 2; NV = Movies.get_rating(V, N) - 2; smU += NU; smV += NV; UU += NU * NU; VV += NV * NV; UV += NU * NV; }

Prune.C4 auto double dm = csUV.get_count(), Ub = smU / dm, Vb = smV / dm, SCor=(UV-dm*Ub*Vb)/(.00001+(pow((UU-dm*pow(Ub,2)),.5))*

(pow((VV-dm*pow(Vb,2)),.5))); if( SCor < pp->TSb ) continue; auto pair<double,unsigned long long int> entry(SCor,V); corR.insert(entry); } if ( corR.size() == 0 ) return; load_ptree(supM, corR, pp->Ct); return; }

/* user_vote: CommonCoSupportPrune */static void uPrune2(PTree & supM, PTree & supU, struct pruning_parameters *pp){ if ( supM.get_count() < pp->Ct+1) return; map_t corR; auto PTree csUV; auto vector<unsigned long long int> support;

/* CommonCoSup userPRUNE VN loops VLOOP start */ load_support_vector(supM, support, pp->ustrt, pp->ustrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int V = support[lp];

if ( outside_window(V, pp) ) continue;

csUV = supU & Users.get_movies(V); auto double dm = csUV.get_count(); auto pair<double, unsigned long long int> entry(dm, V); corR.insert(entry); }

auto unsigned int select_count = (unsigned int) pp->Ct; auto PTree ccsU = supU; map_t::iterator begin = corR.begin(); supM.clearall(); if ( corR.size() < pp->Ct ) select_count = corR.size();

for(unsigned int lp= 0; lp < select_count; ++lp) { supM.setbit(begin->second); ccsU = ccsU & Users.get_movies(begin->second); ++begin; } supU = ccsU; return;}

Page 33: The $1,000,000 Netflix Contest

/* movie_voting: CommonCoSupportPrune */static void mPrune2(PTree & supM, PTree & supU, struct pruning_parameters *pp){ if ( supU.get_count() < (pp->Ct + 1) ) return; map_t corRm; auto PTree csMN; auto vector<unsigned long long int> support;

/* moviePRUNE NV loops NLOOP start */ load_support_vector(supU, support, pp->mstrt, pp->mstrt_mult);

for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int N = support[lp]; if ( outside_window(N, pp) ) continue; csMN = supM & Movies.get_users(N); auto double dm = csMN.get_count(); auto pair<double, unsigned long long int> entry(dm, N); corRm.insert(entry); }

auto unsigned int select_count = (unsigned int) pp->Ct; auto PTree ccsM = supM; map_t ::iterator begin = corRm.begin(); supU.clearall(); if ( corRm.size() < select_count ) select_count = corRm.size(); for(unsigned int lp= 0; lp < select_count; ++lp) { supU.setbit(begin->second); ccsM = ccsM & Movies.get_users(begin->second); ++begin; } supM = ccsM; return;}/* Internal function. * This function dispatches execution to the pruning method which has * been selected for an external pruning routine. * \param pcfg A pointer to the structure defining the * external pruning to be conducted. * \param M The movie whose rating is to be predicted. * \param U The user who the predication is to be made for. * \param supM A PTree describing the movie support. * \param supU A PTree describing user support. */void do_pruning(struct external_prune * const prune, unsigned long int M, \ unsigned long int U, PTree & supM, PTree & supU){

Prune.C5

auto struct pruning_parameters *params = &prune->params;

switch ( prune->method ) { case UserPrune: uPrune(U, supM, supU, params); break; case UserFastPrune: fuPruneS(U, supM, supU, params); break; case UserCommonCoSupportPrune: uPrune2(supM, supU, params); break;

case MoviePrune: mPrune(M, supM, supU, params); break; case MovieFastPrune: fmPruneS(supU, params); break; case MovieCommonCoSupportPrune: mPrune2(supM, supU, params); break; }

return;}

Page 34: The $1,000,000 Netflix Contest

cd Output../mpp-glue1 ../$1cd ..mpp-rmse1 ./$1

run script for processing movie_predict files into 1 movie_prediction file (and also 1 .rmse and 1 .out log file).

#! /bin/bash# This utility 'glues' a set of .predict files for a given run# of mpp-mpred into a single file. This program is driven # by the input file used for the prediction run. When it finds # a movie (delimited by a trailing :) ALL entries in files, # InputFileName_movieID.predict, in the current directory# are printed to a file, InputFileName.txt.prediction.# The utility takes as the single argument, InputFileName# used for the prediction run

# Verify input file is found.if [ -z "$1" ]; then echo "Error: Input file not specified."; exit 1;fi;if [ ! -e "$1" ]; then echo "Error: Input file not found - >$Input<"; exit 1;fi;

mpp-glue script

# Remove any old output files and make sure we have a fresh backup directory.rm -f $Output $Logfile;if [ -d "$Backup" ]; then echo "Error: Backup directory present."; exit 1;fi;mkdir $Backup;# Loop over prediction input file and generate outputs.cat $Inputfile | while read input;do if [ "$input" != "${input%%:}" ]; then Movie=${input%%:}; Predictions="$Name"_$Movie.predict; Log="$Name"_$Movie.log; if [ ! -e "$Predictions" ]; then echo "Error: Prediction file not found - " \ ">$Predictions<"; exit 1; fi; echo "Processing: $Movie"; cat $Predictions >>$Output;# [ -e "$Log" ] && cat $Log >>$Logfile; rm $Predictions;# previous line added# with following commented out, it seem to eliminate backing up.# mv $Predictions $Backup;

# Variables global to this module.declare -r Name=`basename $1`;declare -r Output="$Name.predictions" Logfile="$Name.logfile";declare -r Backup="$Name.backup";declare Inputfile=$1;declare Movie;declare Predictions Log;declare Current_Dir;# Main body of the program occurs here.# If a directory named Output is present assume# we should use that directory.if [ -d "./Output" ]; then Current_Dir=`pwd`; Inputfile="../$Inputfile"; cd Output;fi;

# if [ $? -ne 0 ]; then echo "Error: Unable to create predictions backup.";# exit 1; fi;# if [ -e "$Log" ]; then mv $Log $Backup;# if [ $? -ne 0 ]; then echo "Error: Unable to create logs backup.";# exit 1; fi; fi; fi;done;# All done.echo -e "\nInputfile: $Inputfile";echo -e "\tPredictions:\t$Output";echo -e "\tLogfile:\t$Logfile";echo -e "\tBackups:\t$Backup";echo -e "\nLine count verifications:";echo -e "\t$(wc -l $Inputfile)";echo -e "\t$(wc -l $Output)";[ -n "$Current_Dir" ] && cd ..;exit 0 mpp-glue

Takes allInputFileName_movieID1.predict…InputFileName_movieIDn.predictin current directory as input (deleted after processing)

Puts as output (in current dir)InputFileName.txt.predictions

Page 35: The $1,000,000 Netflix Contest

#! /bin/bash# This utility generates an RMSE report based on predictions carried# out on the 'probe' dataset. It compares a prediction list against# the set of known files.# This program is driven by the input file used for the prediction# run. The majority of the comparative work and generation of the# RMSE values is done by the PERL script called from this script.# The PERL script reads both the prediction file# (Output/InputFileName.txt.prediction) and the list of known answers # (InputFileName.txt.answers in the current directory). # When a movie is found it verifies the movie is# also present in the companion file. This is to insure there are# no discrepancies between the two files.# The utility takes as a single argument the name of the input file# used for the prediction run.# Verify input file is found.if [ -z "$1" ]; then echo "Error: Input file not specified."; exit 1;fi;if [ ! -e "$1" ]; then echo "Error: Input file not found - >$Input<"; exit 1;fi;# Variables global to this module.declare -r Startdir=`dirname $0`;declare -r Basename=`basename $1`;declare -r Answers="$1.answers";declare -r Predictions="Output/$Basename.predictions";

if [ ! -e "$Answers" ]; then echo "Answers file not found - >$Answers<."; exit 1;fi;if [ ! -e "$Predictions" ]; then echo "Predictions file not found - >$Predictions<."; exit 1;fi;# Main body of the program occurs here.perl $Startdir/mpp-rmse.pl $Answers $Predictions

| tee "$Basename.rmse";exit 0

mpp-rmse1 script$answers = $ARGV[0]; $predictions = $ARGV[1];$lp = 0; $cnt = 0; $error = 0; $error_sum = 0;$total_error = 0; $total_cnt = 0; $last_movie = "";chomp(@answers = `cat $answers`); chomp(@predictions = `cat $predictions`);foreach(@answers) { if ( /:$/ ) { if ( $last_movie ne "" ) { printf "\n\tSum: %.5f\tTotal: %-5d\tRMSE: %f\n\n", $error_sum, $cnt, sqrt($error_sum/$cnt); printf "\tRunning RMSE: %f / %d predictions\n\n", sqrt($total_error/$total_cnt), $total_cnt; $error_sum = 0; $cnt = 0; } $last_movie = $_; print "Movie: $_\n"; if ( $_ ne $predictions[$lp] ) { print "Movies don't match\n"; print "\t$_ vs. $predictions[$lp]\n"; exit 1; } ++$lp; next; } # Correct for an NAN if ( $predictions[$lp] eq "nan" ) { print "NAN"; $predictions[$lp] = "3.70"; } if ( $predictions[$lp] eq "corm-nan" ) { print "CORM-NAN"; $predictions[$lp] = "3.70"; }

mpp-rmse1.pl

$error = ($_ - $predictions[$lp])**2; $error_sum += $error; $total_error += $error; ++$total_cnt; ++$cnt; printf "\t%4d:\tAnswer: %2d\tPrediction: $predictions[$lp]\tError: %.5f\n", $cnt -1, $_, $error; ++$lp; }# Print the RMSE from the last movie.printf "\n\tSum: %.5f\tTotal: %-5d\tRMSE: %f\n\n", $error_sum, $cnt, sqrt($error_sum/$cnt);# Then the total RMSE for the run.print "Prediction summary:\n";printf "\tSum: %.5f\tTotal: %-5d\tRMSE: %f\n\n", $total_error, $total_cnt, sqrt($total_error/$total_cnt);exit 0;

mpp-rmse

TakesOutput/InputFileName.txt.predictions andInputFileName.txt.answersfrom current directory as input

Puts as output (in current dir)InputFileName.txt.rmse

Page 36: The $1,000,000 Netflix Contest

#! /bin/bash# Variables global to this module.declare -r Pgm=`basename $0`;declare Mode="both";# This utility reduces a set of movies to be predicted by outputting# movies which have an RMSE value greater than a specified threshold.# This program is driven by the input file used for the prediction# run. The majority of the comparative work and generation of the# RMSE values is done by the PERL script called from this script.# If the first argument to the utility is a -m the next argument# is interpreted as a mode value. The following arguments are accepted:# low: Output only low RMSE pairings.# high: Output only high RMSE pairings.# both: Output both files.# The default is for both files to be output.if [ "$1" = "-m" ]; then case $2 in low) Mode="low";; high) Mode="high";; both) Mode="both";; *) echo -e "$Pgm: Unknown argument to mode switch, \c"; echo "specify low, high or both."; exit 1;; esac; shift 2;fi;# The utility takes four general argumns as follows:## $1: Inputfile# $2: RMSE threshold value.# $3: Root name of output file for movies below threshold.# $4: Root name of output file for movies above threshold.

# Verify input file is found.if [ -z "$1" -o ! -e "$1" ]; then echo "$Pgm: Error - Input file not specified."; echo echo "Command format:" echo -e "\t$Pgm [-m low|high|both] Inputfile Threshold \c"; echo -e "LowOutFile HighOutfile"; exit 1;fi;

mpp-user-reduce script if [ -z "$2" ]; then echo "$Pgm: Error - RMSE threshold not specified."; exit 1;fi;if [ -z "$3" ]; then echo "$Pgm: Error - Low output filename not specified."; exit 1;fi;if [ -z "$4" ]; then echo "$Pgm: Error - High output filename not specified."; exit 1;fi;# Variables global to this module which are dependent on command-line options.declare -r Input=$1;declare -r Startdir=`dirname $0`;declare -r Basename=`basename $1`;declare -r Answers="$1.answers";declare -r Predictions="Output/$Basename.predictions";declare -r Threshold=$2;declare -r LowOut=$3;declare -r HighOut=$4;if [ ! -e "$Answers" ]; then echo "$Pgm: Error - Answers file not found: >$Answers<."; exit 1;fi;if [ ! -e "$Predictions" ]; then echo "$Pgm - Predictions file not found: >$Predictions<."; exit 1;fi;# Main body of the program occurs here.perl -w $Startdir/mpp-user-reduce.pl $Input $Answers $Predictions $Threshold \ $LowOut $HighOut $Mode;exit 0

mpp-user-reduce–m both|low|high InputFile.txt SqErrThrhd

mpp-user-reduce -m both Data/probe19.txt .0001 lo19 hi19 Takes input,Data/probe19.txt (movieID with interleaved userIDs format or .txt format)SqErrThrhld (if SqErr ≤ .0001, put pair in lo19.txt, else put in hi19.txt)-m both means both lo and hi will be produced (other options: low or high)

Puts as outputlo-FileNamehi-FileName

Page 37: The $1,000,000 Netflix Contest

$Input = $ARGV[0];$Answers = $ARGV[1];$Predictions = $ARGV[2];$Threshold = $ARGV[3];$LowOut = $ARGV[4];$HighOut = $ARGV[5];$Mode = $ARGV[6];$Low_Count = 0;$High_Count = 0;

# Subroutine outputs pairing results for a given collection of user/movie ratings;sub Output_Pairing{ my($file, $rmse_ptr) = @_; my($inputfile, $answerfile, $user, $answer);

# Open input and answer files. $inputfile = $file . ".txt"; print "\t\tInput: $inputfile\n"; open(NEWINPUT, ">$inputfile") || die "Cannot open new inputfile: $inputfile";

$answerfile = $file . ".txt.answers"; print "\t\tAnswers: $answerfile\n"; open(ANSWERS, ">$answerfile") || die "Cannot open new answer file: $answerfile.";

# The outer loop runs over the movies in a grouping. The inner # loop then runs over the set of inputs and answers for that movie. foreach ( keys(%{$rmse_ptr}) ) { print NEWINPUT "$_\n"; print ANSWERS "$_\n";

foreach ( @{$$rmse_ptr{$_}} ) { ($user, $answer) = split; print NEWINPUT "$user\n"; print ANSWERS "$answer\n"; } } close(NEWINPUT); close(ANSWERS); return;}

mpp-user-reduce.pl # Main program starts here.# Load input, answers and predictions into arrays which are stored in# hashes keyed by movie number.open(INPUT, $Input) || die "Cannot open input: $Input";while ( <INPUT> ) {chomp; if (/:$/) {$key = $_; $Input{$key}=[];} else { push(@{$Input{$key}}, $_); } }close(INPUT);open(INPUT, $Answers) || die "Cannot open answer file: $Answers";while ( <INPUT> ) {chomp; if (/:$/) {$key = $_; $Answers{$key}=[];} else { push(@{$Answers{$key}}, $_); } }close(INPUT);open(INPUT, $Predictions) || die "Cannot open predictions file: $Predictions";while ( <INPUT> ) { chomp; if ( /:$/ ) { $key = $_; $Predictions{$key} = []; } else { push(@{$Predictions{$key}}, $_); } }close(INPUT);

foreach( keys(%Answers) ) { my $lp; my $error; $movie = $_; @users = @{$Input{$movie}}; @ans = @{$Answers{$movie}}; @pred = @{$Predictions{$movie}};

for ($lp= 0; $lp <= $#ans; ++$lp) { $user = $users[$lp]; $predict = $pred[$lp];

# Correct for NAN's and CORM-NAN if ($pred[$lp] eq "nan"){print "NAN"; $predict="3.70";} if ($pred[$lp] eq "corm-nan"){print "CORM-NAN";$predict="3.70";} $error = ($ans[$lp] - $predict)**2; if ( $error > $Threshold ) { $HighRMSE{$movie} = [] if !defined($HighRMSE{$movie}); push(@{$HighRMSE{$movie}},"$user $ans[$lp]");++$High_Count;} else { $LowRMSE{$movie} = [] if !defined($LowRMSE{$movie}); push(@{$LowRMSE{$movie}},"$user $ans[$lp]");++$Low_Count;} } }# Output new input and predictions files based on the reduced set.print "Selected movie/user pairings based on RMSE = $Threshold:\n";if ( ($Mode eq "low") or ($Mode eq "both") ) { print "\tLow rmse pairs: ", $Low_Count, "\n"; Output_Pairing($LowOut, \%LowRMSE); print "\n"; }if ( ($Mode eq "high") or ($Mode eq "both") ) { print "\tHigh rmse pairs: ", $High_Count, "\n"; Output_Pairing($HighOut, \%HighRMSE); }# All done.exit 0;

Page 38: The $1,000,000 Netflix Contest

#! /bin/bash This is a driver program for implementing a utility for ANDing or# ORing two input files.# Variables global to this module.declare -r Pgm=`basename $0`;declare Mode="";# Parse arguements.while getopts "M:" Arg;do case $Arg in M) Mode=$OPTARG;; esac;done;

# Sanity checks.if [ -z "$Mode" ]; then echo "$Pgm: No mode specified."; exit 1;fi;if [ "$Mode" != "and" -a "$Mode" != "or" ]; then echo "$Pgm: Invalid mode specifed - $Mode"; exit 1;fi;# Verify two filenames are present.shift `expr $OPTIND - 1`;if [ $# -ne 2 ]; then echo "$Pgm: Insufficient filenames specified."; exit ;fi;# Call Perl to carry out the boolean filtering operation.exec perl $Pgm.pl $Mode $*;

mpp-filter script (for unioning (-M or), intersecting (-M and) clusters (to check coverage, etc.)

# This script implements boolean filtering operations between two input# files. The results of the filtering operation are output on stdout.# Two merge modes are supported:# AND: A user index is output if it exists for a given movie in both input files.# OR: A movie/user pair is output if it exists in either input file.$Mode = $ARGV[0];$Input1 = $ARGV[1];$Input2 = $ARGV[2];# The following subroutine loads a file into an associative array. The# filename to be read is passed to the subroutine as the first arguement.# A reference to the associative array is passed as the second arguement.# If the filename cannot be opened an error exit is taken from the applic.sub Load_File{ my $key, $file = $_[0], $hptr = $_[1]; open(IN, $file) || die "Cannot open file: $file"; while ( $_ = <IN> ) { chomp; if ( /:$/ ) { $key = $_; $$hptr{$key} = []; } else { push(@{$$hptr{$key}}, $_); } } close(IN); return;}# Subroutine outputs a file which has been stored in hashed/array format.sub Output_File{ foreach ( keys(%{$_[0]}) ) { print "$_\n";

my @hlist = @{$_[0]{$_}}; foreach ( @hlist ) { print "$_\n"; } }

mpp-filter.pl

Page 39: The $1,000,000 Netflix Contest

Appendix 1: additional codes Directories

$ ls -l-rwxr-xr-x 1 perrizo faculty 259 Nov 29 11:22 cluster-corr-rw-r--r-- 1perrizo faculty 1.2K Nov 29 11:22 cluster-corr.pl-rwxr-xr-x 1 perrizo faculty 7.7K Feb 1 12:26 config-rw-r--r-- 1 perrizo faculty 14K Nov 29 11:22 config.c-rw-r--r-- 1 perrizo faculty 1.4K Nov 29 11:22 config.h-rw-r--r-- 1 perrizo faculty 5.6K Nov 29 11:25 config.o-rw-r--r-- 1 perrizo faculty 38K Nov 29 11:25 config-parser.c-rw-r--r-- 1 perrizo faculty 806 Nov 29 11:22 config-parser.l-rw-r--r-- 1 perrizo faculty 15K Nov 29 11:25 config-parser.o-rw-r--r-- 1 perrizo faculty 2.4K Nov 29 11:22 cosupport.Cdrwxr-xr-x 2 perrizo faculty 12K Feb 2 12:40 Datadrwxr-xr-x 2 perrizo faculty 4.0K Nov 29 11:25 libPTree-rw-r--r-- 1 perrizo faculty 4.1K Nov 29 11:22 Makefile-rwxr-xr-x 1 perrizo faculty 16K Nov 29 11:25 movie-corr-rw-r--r-- 1 perrizo faculty 1.3K Nov 29 11:22 movie-corr.C-rw-r--r-- 1 perrizo faculty 2.3K Nov 29 11:22 MovieCorrelation.C-rw-r--r-- 1 perrizo faculty 1.4K Nov 29 11:22 MovieCorrelation.H-rw-r--r-- 1 perrizo faculty 9.6K Nov 29 11:25 MovieCorrelation.o-rw-r--r-- 1 perrizo faculty 3.3K Nov 29 11:25 movie-corr.o-rw-r--r-- 1 perrizo faculty 1.5K Nov 29 11:22 movie-rating.C-rw-r--r-- 1 perrizo faculty 1.5K Nov 29 11:22 movie-set.C-rw-r--r-- 1 perrizo faculty 2.0K Nov 29 11:22 MovieSet.C-rw-r--r-- 1 perrizo faculty 1.1K Nov 29 11:22 MovieSet.H-rw-r--r-- 1 perrizo faculty 4.2K Nov 29 11:25 MovieSet.o-rw-r--r-- 1 perrizo faculty 14K Jan 19 07:07 movie-vote.C-rw-r--r-- 1 perrizo faculty 9.7K Jan 19 07:07 movie-vote.o-rwxr-xr-x 1 perrizo faculty 303 Nov 29 11:22 mpp-rwxr-xr-x 1 perrizo faculty 1.3K Nov 29 11:22 mpp-cluster-list-rw-r--r-- 1 perrizo faculty 2.5K Nov 29 11:22 mpp-cluster-list.pl-rw-r--r-- 1 perrizo faculty 1.7K Nov 29 11:22 mppConfig.C-rw-r--r-- 1 perrizo faculty 1.1K Nov 29 11:22 mppConfig.H-rw-r--r-- 1 perrizo faculty 2.9K Nov 29 11:25 mppConfig.o-rwxr-xr-x 1 perrizo faculty 745 Dec 5 11:32 mpp-filter-rw-r--r-- 1 perrizo faculty 3.0K Dec 5 11:32 mpp-filter.pl-rwxr-xr-x 1 perrizo faculty 2.3K Nov 29 11:22 mpp-glue-rw-r--r-- 1 perrizo faculty 591 Nov 29 11:22 mpp.h-rwxr-xr-x 1 perrizo faculty 101K Feb 2 12:38 mpp-mpred-rw-r--r-- 1 perrizo faculty 13K Nov 29 11:22 mpp-mpred.C-rw-r--r-- 1 perrizo faculty 29K Nov 29 11:25 mpp-mpred.o-rwxr-xr-x 1 perrizo faculty 1.4K Nov 29 11:22 mpp-rmse-rw-r--r-- 1 perrizo faculty 1.5K Nov 29 11:22 mpp-rmse.pl-rw-r--r-- 1 perrizo faculty 6.9K Jan 19 06:36 mpp-user.C-rwxr-xr-x 1 perrizo faculty 1.3K Nov 29 11:22 mpp-user-cluster-rw-r--r-- 1 perrizo faculty 3.8K Nov 29 11:22 mpp-user-cluster.pl-rw-r--r-- 1 perrizo faculty 11K Jan 19 07:07 mpp-user.o-rwxr-xr-x 1 perrizo faculty 2.5K Jan 21 17:38 mpp-user-reduce-rw-r--r-- 1 perrizo faculty 3.2K Jan 21 17:38 mpp-user-reduce.pl

drwxr-xr-x 75 perrizo faculty 2.3M Feb 2 13:13 Outputdrwxr-xr-x 3 perrizo faculty 4.0K Jan 8 13:28 p19drwxr-xr-x 5 perrizo faculty 4.0K Jan 8 13:34 p95drwxr-xr-x 5 perrizo faculty 4.0K Jan 31 10:51 pf-rw-r--r-- 1 perrizo faculty 22K Nov 29 11:22 PredictionConfig.C-rw-r--r-- 1 perrizo faculty 5.3K Nov 29 11:22 PredictionConfig.H-rw-r--r-- 1 perrizo faculty 22K Nov 29 11:25 PredictionConfig.o-rw-r--r-- 1 perrizo faculty 19K Feb 2 12:38 prune.C-rw-r--r-- 1 perrizo faculty 29K Feb 2 12:38 prune.o-rw-r--r-- 1 perrizo faculty 1.2K Nov 29 11:22 read-user-ptrees.C-rwxr-xr-x 1 perrizo faculty 146 Nov 29 13:59 run-rwxr-xr-x 1 perrizo faculty 74K Dec 16 20:47 show-config-rw-r--r-- 1 perrizo faculty 454 Nov 29 11:22 show-config.C-rw-r--r-- 1 perrizo faculty 2.7K Nov 29 11:25 show-config.o-rw-r--r-- 1 perrizo faculty 6.4K Nov 29 11:22 UserSet.C-rw-r--r-- 1 perrizo faculty 1.5K Nov 29 11:22 UserSet.H-rw-r--r-- 1 perrizo faculty 7.2K Nov 29 11:25 UserSet.o-rw-r--r-- 1 perrizo faculty 17K Jan 19 06:57 user-vote.C-rw-r--r-- 1 perrizo faculty 9.3K Jan 19 07:07 user-vote.o

$ ls -l Data-rw-r--r-- 1 perrizo faculty 67 Dec 18 01:32 p1.txt-rw-r--r-- 1 perrizo faculty 23 Dec 18 01:32 p1.txt.answers-rw-r--r-- 1 perrizo faculty 533K Dec 18 01:33 probe-1000.txt-rw-r--r-- 1 perrizo faculty 146K Dec 18 01:33 probe-1000.txt.answers-rw-r--r-- 1 perrizo faculty 1.9K Dec 18 01:32 probe19.txt-rw-r--r-- 1 perrizo faculty 611 Dec 18 01:32 probe19.txt.answers-rw-r--r-- 1 perrizo faculty 23K Dec 18 01:32 probe95.txt-rw-r--r-- 1 perrizo faculty 6.4K Dec 18 01:32 probe95.txt.answers-rw-r--r-- 1 perrizo faculty 594K Dec 18 01:32 test-probe-1000.txt-rw-r--r-- 1 perrizo faculty 162K Dec 18 01:32 test-probe-1000.txt.answers-rw-r--r-- 1 perrizo faculty 51K Dec 18 01:32 test-probe-100.txt-rw-r--r-- 1 perrizo faculty 14K Dec 18 01:32 test-probe-100.txt.answers

$ ls -l libPTree-rw-r--r-- 1 perrizo faculty 18672 Nov 29 11:25 libPTree.a-rw-r--r-- 1 perrizo faculty 3192 Nov 29 11:22 Makefile-rw-r--r-- 1 perrizo faculty 15813 Nov 29 11:22 PTree.C-rw-r--r-- 1 perrizo faculty 2973 Nov 29 11:22 PTree.H-rw-r--r-- 1 perrizo faculty 11096 Nov 29 11:25 PTree.o-rw-r--r-- 1 perrizo faculty 18135 Nov 29 11:22 PTree-omp.C-rw-r--r-- 1 perrizo faculty 3796 Nov 29 11:22 ptree-op-test.C-rw-r--r-- 1 perrizo faculty 488 Nov 29 11:22 ptree-read.C-rw-r--r-- 1 perrizo faculty 779 Nov 29 11:22 ptree-save.C-rw-r--r-- 1 perrizo faculty 7485 Nov 29 11:22 PTreeSet.C-rw-r--r-- 1 perrizo faculty 1179 Nov 29 11:22 PTreeSet.H-rw-r--r-- 1 perrizo faculty 6464 Nov 29 11:25 PTreeSet.o-rw-r--r-- 1 perrizo faculty 2265 Nov 29 11:22 ptreeset-read.C-rw-r--r-- 1 perrizo faculty 420 Nov 29 11:22 ptree-test.C-rw-r--r-- 1 perrizo faculty 16127 Nov 29 11:22 PTree-x86_64.C-rw-r--r-- 1 perrizo faculty 16127 Nov 29 11:22 PTree-x86.C

$ ls -l Output ...-rw-r--r-- 1 perrizo faculty 32157 Feb 2 13:25 probe-full.txt_9939.predict ...drwxr-xr-x 2 perrizo faculty 901120 Jan 20 06:23 probe-full.txt.backup-rw-r--r-- 1 perrizo faculty 7059980 Jan 20 06:23 probe-full.txt.predictions

Page 40: The $1,000,000 Netflix Contest

MakefileVERSION = 2.6.0# Default directory where PTree data is stored. # Overriden below depending on architecture.PTREEDATA = /tmp

# Set compiler behavior based on architecture.ARCH := $(shell uname -m | sed -e s/i686/x86/)ifeq (${ARCH}, x86_64) COMPILER = gcc # COMPILER = gcc4 PTREEDATA = /scratch/perrizoendif

ifeq (${ARCH}, ia64) # COMPILER = intel COMPILER = gcc4endif

ifeq (${ARCH}, x86) COMPILER = gccendif

ifndef (${COMPILER},) ifeq (${COMPILER}, gcc4) CC = /opt/gcc4/bin/gcc C++ = /opt/gcc4/bin/g++

# WARNINGS = -W -Wall -Wchar-subscripts -Wshadow \-Wpointer-arith -Wwrite-strings -Wmissing-prototypes# VECTOR = -ftree-vectorize -ftree-vectorizer-verbose=5

OPT = -O2 ${VECTOR} ifeq (${ARCH}, x86_64) OPT += -msse2 endif

C_DEBUG = -g -pg LD_DEBUG = -g -pgendif

ifeq (${COMPILER}, gcc) CC = gcc C++ = g++# WARNINGS = -W -Wall -Wchar-subscripts -Wshadow \-Wpointer-arith -Wwrite-strings -Wmissing-prototypes# VECTOR = -ftree-vectorize -ftree-vectorizer-verbose=5

OPT = -O2 ${VECTOR} ifeq (${ARCH}, x86_64) OPT += -msse2 endif C_DEBUG = -g -pg LD_DEBUG = -g -pg endif ifeq (${COMPILER}, pgroup) CC = pgcc C++ = pgCC OPT = -fast -Minline=levels:10 C_DEBUG = -g -Minfo #-pg LD_DEBUG = -g -tp core2-64 #-pg endif

ifeq (${COMPILER}, intel) CC = icpc C++ = icpc OPT = -O2 C_DEBUG = -g -p LD_DEBUG = -g -p endifendifINCLUDES = -I./libPTreeCFLAGS = ${OPT} ${WARNINGS} ${INCLUDES}ifdef DEBUG CFLAGS += ${C_DEBUG}endififdef DEBUG LDFLAGS += ${LD_DEBUG}endifOBJS = mpp-mpred.o mpp-user.o mppConfig.o PredictionConfig.o \ MovieCorrelation.o UserSet.o MovieSet.o movie-vote.o user-vote.o \ prune.o config.o config-parser.oLIB = ./libPTree/libPTree.aLIBS = -lfl -L ./libPTree -lPTree

# Executable target definitions.all: mpp-mpred show-config movie-corrmpp-mpred: ${OBJS} ${LIB} ${C++} ${LDFLAGS} -o $@ $^ ${LIBS};cosupport: cosupport.o UserSet.o MovieSet.o ${LIB} ${C++} ${LDFLAGS} -o $@ $^ ${LIBS};tools: movie-rating movie-setmovie-rating: movie-rating.o UserSet.o MovieSet.o ${LIB} ${C++} ${LDFLAGS} -o $@ $^ ${LIBS}movie-set: movie-set.o UserSet.o ${LIB} ${C++} ${LDFLAGS} -o $@ $^ ${LIBS};movie-corr: movie-corr.o MovieCorrelation.o ${LIB}

Page 41: The $1,000,000 Netflix Contest

cosupport.C/** * This file contains a driver program to * determine the rating given to * a movie by a user. */

/* Standard include files. */#include <unistd.h>#include <stdio.h>#include <string.h>#include <math.h>

/* Local include files. */#include "UserSet.H"#include "MovieSet.H"

extern int main(int argc, char *argv[])

{ auto MovieSet Movies;

auto UserSet Users;

fputs("Loading user PTree's.\n", stdout); if ( !Movies.load_binary() ) { fputs("Cannot load binary PTree's.\n", stderr); return 1; } fputs("Loading movie PTree's.\n", stdout); if ( !Users.load_binary() ) { fputs("Cannot load binary PTree's.\n", stderr); return 1; }

fputs("Loading user identities.\n\n", stdout); if ( !Users.load_identities() ) { fputs("Cannot load user identities.\n", stderr); return 1; }

if ( argv[1] == NULL ) { fputs("Need V specified.\n", stderr); return 1; } auto unsigned long int U = 421582, M = 0, V = strtoul(argv[1], NULL, 10); auto PTree M_support = Movies.get_users(M); auto PTree Voters(M_support); Voters.clearbit(U); unsigned long long int *voters = Voters.get_indexes();

fputs("Voter list:\n", stdout); for (size_t voter= 0; voter < Voters.get_count(); ++voter) fprintf(stdout, "%zu: %llu\n", voter, voters[voter]); fputc('\n', stdout); auto PTree cosupport; fputs("Voter Map:\n", stdout); Voters.dump(stdout);

fputs("U Map:\n", stdout); (Users.get_movies(U)).dump(stdout); fputs("V Map:\n", stdout); (Users.get_movies(V)).dump(stdout); cosupport = Users.get_movies(U) & Users.get_movies(V); fputs("Cosupport Map:\n", stdout); cosupport.dump(stdout); cosupport.clearbit(M);

fprintf(stdout, "Cosupport, M= %lu, U = %lu, V = %lu\n", M,U,V); auto double Ubar = Users.get_mean(U, cosupport),

Vbar = Users.get_mean(V, cosupport), Vrt = Users.get_rating(V, M);

auto double vote = Vrt - Vbar + Ubar; auto unsigned long long int *movies = cosupport.get_indexes(); for (unsigned long int movie= 0; movie < cosupport.get_count(); \ ++movie) fprintf(stdout, "\t\t\t\t%lu [%lu]:\tU = %0.2f, V = %0.2f\n",\ Movies.get_identity(movies[movie]), movies[movie], \ Movies.get_rating(U, movies[movie]), \ Movies.get_rating(V, movies[movie]));

fprintf(stdout"\t\t\t%.2f\t[Vrt: %.2f Vbar: %.2f Ubar: %.2f]\n", vote, Vrt, Vbar, Ubar);

return 0;}

Page 42: The $1,000,000 Netflix Contest

movie-corr.C/** \file This file implements a program for * printing movie-movie correlations.*//* Standard include files. */#include <stdio.h>#include <stdlib.h>#include <unistd.h>

/* Local include files. */#include "MovieCorrelation.H"

/* Program entry point. */extern int main(int argc, char *argv[]){ auto bool dump = false; auto int gopt; auto unsigned int target=0, movie=0; auto MovieCorrelation mvcorr; while ((gopt=getopt(argc,argv,"dm:t:"))!=EOF){ switch ( gopt ) { case 'd': dump = true; break; case 'm': movie = atoi(optarg); break; case 't': target = atoi(optarg); break; } }

if ( movie == 0 ) { fputs("movie-corr: No movie specified.\n", stderr); return 1; } if ( !mvcorr.load(movie) ) { printf("Error loading movies.\n"); return 1; }

/* Dump movies and correlations. */ if ( dump ) { fprintf(stdout, "Correlations for movie: %u\n", movie); for (unsigned int lp= 0; lp < MOVIE_COUNT; ++lp) fprintf(stdout, "\t%5u: %7.4f / %d\n", lp + 1, \ mvcorr.supp(lp), mvcorr.corr(lp)); return 0; } /* Print correlation of target movie. */ if ( target > 0 ) { fprintf(stdout,"%-7.4f\n",mvcorr.corr(target-1));return 0;} return 0;}

mpp#! /bin/bashif [ "$1" != "-i" ]; then echo "No input file specified."; exit 1;fi;

shift;inputfile="$1";

run_name=`basename $inputfile`;rm -f $run_name.out;./mpp-mpred -i $inputfile $* >"$run_name.out" 2>&1 &

while [ ! -e "$run_name.out" ];do sleep 1s;done;tail -f "$run_name.out";exit;

mpp.h/** \file * This file contains general definitions and * defines for the PTree * based Netflix prediction system. */

/* External variable declarations. */extern UserSet Users;extern MovieSet Movies;

/* Function declarations. */extern void do_pruning(struct external_prune * const prune, unsigned long int M, unsigned long int U, \ PTree & supM, PTree & supU);

double user_vote(PredictionConfig *, unsigned long int, PTree &, unsigned long int, PTree &);

double movie_vote(PredictionConfig *, unsigned long int, PTree &, unsigned long int, PTree &);

Page 43: The $1,000,000 Netflix Contest

MovieCorrelation.C/** \file * This file contains the implementation of a class * which encapsulates management of correlation info * for a particular movie to all other movies. */

/* System include files. */#include <stdlib.h>

/* Standard C++ includes. */#include <string>#include <iostream>#include <fstream>

/* Local include files. */#include "MovieCorrelation.H"

using namespace std;

MovieCorrelation::MovieCorrelation(void)

{ movie_index = 0;

/* Initialize correlation and support count. */ for (unsigned int lp= 0; lp <= MOVIE_COUNT + 1; ++lp) { support[lp] = 0; correlations[lp] = 0.0; }

return;}

/** * Destructor. */

MovieCorrelation::~MovieCorrelation(void)

{return;}

/*Public method. * Implements loading of correlation and support vector for given movie. * \param index The index number of the movie to be loaded. * \return A boolean value is used to indicate the success * or failure of the load. A true value indicates success.*/bool MovieCorrelation::load(unsigned long int index){ auto char snbufr[10]; auto string root = PTREEDATA"/mpred-data/", corr_path = root + "mv_corr/co_mv_", supp_path = root + "mv_supp/sp_mv_"; auto ifstream corr_file, supp_file; /* Sanity check for movie index size. */ if ( index > (MOVIE_COUNT + 1) ) return false; movie_index = index; /* Synthesize the filename of the correlations file and read it. */ snprintf(snbufr, sizeof(snbufr), "%lu", movie_index); string sn(snbufr); string corr_fname = corr_path + sn + ".bin"; corr_file.open(corr_fname.c_str()); if ( corr_file.fail() ) { corr_file.close(); return false; } corr_file.read(reinterpret_cast<char*>(&correlations), \ (MOVIE_COUNT + 1)*sizeof(float)); if ( corr_file.fail() ) { corr_file.close(); return false; } corr_file.close(); /* Synthesize the filename of the support file and read it. */ string supp_fname = supp_path + sn + ".bin"; supp_file.open(supp_fname.c_str()); if ( supp_file.fail() ) { supp_file.close(); return false; } supp_file.read(reinterpret_cast<char*>(&support), \ (MOVIE_COUNT + 1)*sizeof(short int)); if ( supp_file.fail() ) { supp_file.close(); return false; } supp_file.close(); return true;}

Page 44: The $1,000,000 Netflix Contest

MovieCorrelation.H#if !defined(MOVIECORRELATION_H)#define MOVIECORRELATION_H

/* Total number of movies. */#define MOVIE_COUNT 17770

/* Standard include files. */#include <stdio.h>

/* Local include files. */

class MovieCorrelation

{private: /* The index number of the movie whose correlations are loaded. */ unsigned long int movie_index;

/* * The following array contains the list of correlations for * a movie to all the other movies. The array is one based * so a value of one needs to be added to the movie index * number to retrieve the correlation. */ float correlations[MOVIE_COUNT + 1];

/* * The following array contains the support list for the * correlations vector. The vector is one based as is the * correlations vector. */ unsigned short int support[MOVIE_COUNT + 1];

public: /* Void constructor. */ MovieCorrelation(void);

/* Destructor. */ ~MovieCorrelation(void);

/* * Inline accessor methods for returning movie supports and * correlations. */ float inline corr(unsigned int index) { if ( index > (MOVIE_COUNT + 1) ) return 0; return correlations[index + 1]; } unsigned short int inline supp(unsigned int index) { if ( index > (MOVIE_COUNT + 1) ) return 0; return support[index + 1]; }/* Public method for loading the correlation vector for a movie. */ bool load(unsigned long int);};#endif

Page 45: The $1,000,000 Netflix Contest

MovieSet.C/* System include files. */#include <limits.h>/* Local include files. */#include "MovieSet.H"/* Variables static to this module. *//* No arguement constructor.*/MovieSet::MovieSet(void) : ptree_set(){return;}/* Destructor.*/MovieSet::~MovieSet(void){return;}/* Public method calculates rating user provided for movie. * \param user_index The identity number of the user. * \param movie The identity number of the movie. * \return The rating number is returned to the caller.*/double MovieSet::get_rating(unsigned long int user_index, \ unsigned long int movie_index){ auto double rating = 0; auto size_t slot = movie_index * 3;

for (int tree= 2, bit= 0; tree >= 0; --tree, ++bit) { if ( ptree_set[slot + tree].is_set(user_index)) rating += pow(2.0, bit); } return rating;}/* Public method returns PTree describing * set of users who rated movie*/

PTree MovieSet::get_users(unsigned long int index){auto size_t slot = index * 3;return ptree_set[slot] | ptree_set[slot+1] | ptree_set[slot+2];}/* Public method \param output descriptor- PTree's to be directed*/

void MovieSet::dump(FILE *output){ for (int lp= 0; lp < ptree_set.size(); ++lp) ptree_set[lp].dump(output); return;}/* Public method loads binary PTree set which has as its * X-axis user indexes with movie rating PTree's on Y-axis.*/bool MovieSet::load_binary(void){ auto char bufr[PATH_MAX]; auto FILE *input; for (int pt= 22; pt <= 53331; ++pt) { snprintf(bufr, sizeof(bufr), \ "%s/mpred-data/nf_us_mv_pt/p%d.pct", PTREEDATA, pt); if ( (input = fopen(bufr, "r")) == NULL ) return false; if ( !ptree_set.load_binary_file(input) ) return false; fclose(input); } return true;}

MovieSet.H#if !defined(MOVIESET_H)#define MOVIESET_H/* Standard include files. */#include <stdio.h>#include <math.h>/* Local include files. */#include "PTreeSet.H"class MovieSet{private: PTreeSet ptree_set;public: /* Void constructor. */ MovieSet(void);

/* Constructor to initialize an in-memory tree. */ /* Destructor. */ ~MovieSet(void);

/* Public inline method to return identity of movie index*/ unsigned long int get_identity(unsigned long int offset) { return offset + 1; }

/* Public inline method to return index of movie identity*/ unsigned long int get_index(unsigned long int identity) { return identity - 1; }

/* Public method to return rating of movie by user. */ double get_rating(unsigned long int, unsigned long int);

/* Public method to return set of users rating movie. */ PTree get_users(unsigned long int);

/* Public method to print sparseness of set. */ void dump(FILE *);

/* Public method to load a binary PTree set. */ bool load_binary(void);};#endif

Page 46: The $1,000,000 Netflix Contest

/** \file contains implentation of class which encapsulates * info needed to configure prediction run. Purpose of * class is to abstract out diff between single config * run and a run based on a cluster of configurations. */

/* System include files. *//* Local include files. */#include "mppConfig.H"

/* No arguement constructor. */mppConfig::mppConfig(void){ standard_config = false; standard = NULL; cluster_config = false; return;}

/* Destructor. */mppConfig::~mppConfig(void){ if ( standard != NULL ) delete standard; return;}

/* Public method causes the object to be initialized * as a standard single file configuration. * \param cfgfile ptr to buffer containing name of * standard configuration file. * \return If init of configuration is successful * a boolean true value is returned. Otherwise a * false value is returned. */

bool mppConfig::read_config(const char * const cfgfile){ standard = new PredictionConfig; if ( standard == NULL ) return false; if ( !standard->read_config(cfgfile) ) return false;

standard_config = true; return true;}

mppConfig.C /* Public method causes object to be initialized as standard single * file configuration. * \param cfgfile Pointer to buffer containing the name of the * standard configuration file. * \return If initialization of configuration is successful * a boolean true value is returned. Otherwise a * false value is returned.*/bool mppConfig::read_cluster_config(const char * const cfgfile){ return false;}

mppConfig.H#if !defined(MPPCONFIG_H)#define MPPCONFIG_H/* Standard include files. */#include <stdio.h>/* Local include files. */#include "PredictionConfig.H"class mppConfig{private: bool standard_config, cluster_config; PredictionConfig *standard;public: /* Void constructor. */ mppConfig(void);

/* Destructor. */ ~mppConfig(void);/* Public inline accessor methods to determine if a standard* or cluster configuration is being used. */inline bool is_standard_config(void) {return standard_config;}inline bool is_cluster_config(void) {return cluster_config;}/* Public inline accesor method for the standard configuration. */inline PredictionConfig*get_standard_config(void){return standard;}/* Public method to read a configuration file. */bool read_config(const char * const);/* Public method to read a cluster configuration file. */bool read_cluster_config(const char * const);/* Public method to print out a configuration. */void print(FILE *);};#endif

Page 47: The $1,000,000 Netflix Contest

PredictionConfig.C/* \file File contains implementation of class which encapsulates * info which regulates how Movie/User pair predictions are made.*//* System include files. */#include <stdlib.h>#include <string.h>/* Local include files. */#include "PredictionConfig.H"extern "C" {#include "config.h"}/* Internal private function. * This function initializes an internal pruning structure. * \param p A pointer to the structure to be initialized. */static void _init_internal_prune(struct pruning *p){ p->enabled = false; p->weight = false; p->threshold = 0.0; p->exponent = 1.0; return;}/* Internal private function. * This function initializes a structure defining external pruning. * \param p A pointer to the structure to be initialized. */

static void _init_external_prune(struct external_prune *p){ p->enabled = false; p->method = UserPrune; p->params.mstrt = 0; p->params.mstrt_mult = 0.0; p->params.ustrt = 0; p->params.ustrt_mult = 0.0; p->params.TSa = -100; p->params.TSb = -100; p->params.Tdvp = -1; p->params.Tdvs = -1; p->params.Tvdp = -1; p->params.Tvds = -1; p->params.TD = -1; p->params.TP = -1; p->params.PPm = 0.1; p->params.TV = -1; p->params.TSD = -1; p->params.Ch = 1; p->params.Ct = 2; return;}

/* No arguement constructor. */PredictionConfig::PredictionConfig(void) { /* Initialize general prediction parameters. */ name = NULL; user_voting = false; movie_voting = false; user_vote_weight = 1; /* Initialize user voting parameters. */ user_force_vote_in_Voter_Loop = false; user_force_vote_after_Voter_Loop = false; user_reset_support = false; user_boundary_override = false; user_facz = 0.0; user_thrz = 1.0; _init_internal_prune(&dvCorp); _init_internal_prune(&dvCors); _init_internal_prune(&vdCorp); _init_internal_prune(&vdCors); _init_internal_prune(&pCor); _init_internal_prune(&dCor); _init_internal_prune(&sCor); _init_internal_prune(&dUVsdp); _init_internal_prune(&dUVsds); _init_internal_prune(&Vsdp_Usdp); _init_internal_prune(&Vsds_Usds); _init_external_prune(&Prune_Users_in_SupM); _init_external_prune(&Prune_Movies_in_SupU); _init_external_prune(&Prune_Movies_in_CoSupUV); /* Initialize movie voting parameters. */ movie_force_vote_in_Voter_Loop = false; movie_force_vote_outside_Voter_Loop = false;

movie_boundary_override = false; movie_facz = 0.0; movie_thrz = 1.0; _init_internal_prune(&DVCorp); _init_internal_prune(&DVCors); _init_internal_prune(&VDCorp); _init_internal_prune(&VDCors); _init_internal_prune(&PCor); _init_internal_prune(&DCor); _init_internal_prune(&SCor); _init_internal_prune(&dMNsdp); _init_internal_prune(&dMNsds); _init_internal_prune(&Nsdp_Msdp); _init_internal_prune(&Nsds_Msds); _init_external_prune(&Movie_Prune_Users_in_SupM); _init_external_prune(&Movie_Prune_Movies_in_SupU); _init_external_prune(&Movie_Prune_Users_in_CoSupMN); return; }

Page 48: The $1,000,000 Netflix Contest

PredictionConfig.C page 2/* Destructor. */PredictionConfig::~PredictionConfig(void){ if ( name != NULL ) free(name); return; }

/* Internal private fctn determines if config enabled. * \param cf Ptr to configto be tested for the option. * \param var Ptr to name of variable to be tested. * \return Boolean value returned to indicated whether * configuration option has been enabled. True value * indicates variable is enabled else false returned. */static bool _is_enabled(Config cf, const char * const var){ auto char *p; p = Config_Get(cf, var); if(p==NULL) return false; if (strcmp(p,"enabled")==0) return true; return false;}/* Internal private function. * initializes config struct for internal pruning method. * \param cf The configuration which is being used. * \param sp Pointer to the structure to be initialized. * \param name Name of the internal pruning method. * \param threshold Name of variable containing thresh. * \param wt Name of variable specifying whether * method should be used to set the value of uCor.*/void _set_internal_prune(Config cf,struct pruning *sp,const char *name,const char *threshold,const char *weight){ auto char *val; sp->enabled = _is_enabled(cf, name); if ( !sp->enabled ) return; val = Config_Get(cf, threshold); if ( val != NULL ) sp->threshold = atof(val); sp->weight = _is_enabled(cf, weight); return;}

/* Internal private function. * initializes configuration structure for an internal pruning method. * \param cf The configuration which is being used. * \param sp A pointer to the external pruning definition * structure which is to be initialized. * \param name The name of the external pruning method. */

void _set_external_prune(Config cf, struct external_prune *sp, \ const char *name){ auto char *val; auto struct pruning_parameters *pp = &sp->params; if ( !Config_Set_Section(cf, name) ) return; val = Config_Get(cf, "method"); if ( strcmp(val, "UserPrune")==0) sp->method=UserPrune; if ( strcmp(val,"UserFastPrune")==0) sp->method=UserFastPrune; if ( strcmp(val, "UserCommonCoSupportPrune")==0) sp->method=UserCommonCoSupportPrune;

/* Internal private function. * Function initializes config structure for standard * deviation based pruning method. * \param cf Configuration which is being used. * \param sp Ptr to structure to be initialized. * \param name name of the internal pruning method. * \param threshold name of variable containing threshold val. * \param exponent Name of variable specifying the exponent * which should be used for the GAUSSIAN * method should be used to set value of uCor.*/

void _set_stddev_prune(Config cf,struct pruning *sp,const char *name,const char *threshold,const char *exponent){ auto char *val; sp->enabled = _is_enabled(cf, name); if(!sp->enabled)return; val=Config_Get(cf, threshold); if(val!=NULL)sp->threshold=atof(val); val=Config_Get(cf,exponent); if(val!=NULL)sp->exponent=atof(val); return; }

if(strcmp(val,"MoviePrune")==0) sp->method=MoviePrune; if(strcmp(val,"MovieFastPrune")==0)sp->method=MovieFastPrune; if(strcmp(val,"MovieCommonCoSupportPrune")==0) sp->method = MovieCommonCoSupportPrune;

/* Set the external pruning parameters. */ val = Config_Get(cf, "mstrt"); if ( val != NULL ) pp->mstrt = atoll(val); val = Config_Get(cf, "mstrt_mult"); if ( val != NULL ) pp->mstrt_mult = atof(val); val = Config_Get(cf, "ustrt"); if ( val != NULL ) pp->ustrt = atoll(val); val = Config_Get(cf, "ustrt_mult"); if ( val != NULL ) pp->ustrt_mult = atof(val); val = Config_Get(cf, "TSa"); if ( val != NULL ) pp->TSa = atof(val); val = Config_Get(cf, "TSb"); if ( val != NULL ) pp->TSb = atof(val); val = Config_Get(cf, "Tdvp"); if ( val != NULL ) pp->Tdvp = atof(val); val = Config_Get(cf, "Tdvs"); if ( val != NULL ) pp->Tdvs = atof(val); val = Config_Get(cf, "Tvdp"); if ( val != NULL ) pp->Tvdp = atof(val); val = Config_Get(cf, "Tvds"); if ( val != NULL ) pp->Tvds = atof(val); val = Config_Get(cf, "TD"); if ( val != NULL ) pp->TD = atof(val); val = Config_Get(cf, "TP"); if ( val != NULL ) pp->TP = atof(val); val = Config_Get(cf, "PPm"); if ( val != NULL ) pp->PPm = atof(val); val = Config_Get(cf, "TV"); if ( val != NULL ) pp->TV = atof(val); val = Config_Get(cf, "TSD"); if ( val != NULL ) pp->TSD = atof(val); val = Config_Get(cf, "Ch"); if ( val != NULL ) pp->Ch = atof(val); val = Config_Get(cf, "Ct"); if ( val != NULL ) pp->Ct = atof(val); return;}

Page 49: The $1,000,000 Netflix Contest

PredictionConfig.C page 3

/* Public method used for paramers to be associated with * internal pruning type. * \param Enumerated type describing internal pruning return * for which parameter information is to be obtained. * \return Ptr to structure describing how the internal * pruning method is to be implemented. */

struct pruning *PredictionConfig::get_internal_prune (enum internal_pruning pr){ switch ( pr ) { case user_dvCorp: return &dvCorp; case user_dvCors: return &dvCors; case user_vdCorp: return &vdCorp; case user_vdCors: return &vdCors; case user_pCor: return &pCor; case user_dCor: return &dCor; case user_sCor: return &sCor;

case movie_DVCorp: return &DVCorp; case movie_DVCors: return &DVCors; case movie_VDCorp: return &VDCorp; case movie_VDCors: return &VDCors; case movie_PCor: return &PCor; case movie_DCor: return &DCor; case movie_SCor: return &SCor;

/* Standard deviation types */ case user_dUVsdp: return &dUVsdp; case user_dUVsds: return &dUVsds; case user_Vsdp_Usdp: return &Vsdp_Usdp; case user_Vsds_Usds: return &Vsds_Usds;

case movie_dMNsdp: return &dMNsdp; case movie_dMNsds: return &dMNsds; case movie_Nsdp_Msdp: return &Nsdp_Msdp; case movie_Nsds_Msds: return &Nsds_Msds; } return NULL;}

/* Public method. parses configuration file and translates the ASCII * key/value pairs into appropriate configuration variables. * \param file A character pointer to the file name containing * the configuration to be read. * \return A boolean value is returned to indicate whether * or not the read of the configuration file was * successful. A true value indicates success while * failure is indicated by a false value. */

bool PredictionConfig::read_config(const char * const file){ auto char *val; auto Config cf;

/* Open and parse the configuration file. */ cf = Config_Init(); if(cf == NULL ) return false; if(Config_Parse(cf,file)<0){Config_Destroy(cf); return false;}

/* Set general prediction parameters. */ if (!Config_Set_Section(cf,"Default")) {Config_Destroy(cf);return false;} val = Config_Get(cf, "name"); if ( val != NULL ) name = strdup(val); user_voting = _is_enabled(cf, "user_voting"); movie_voting = _is_enabled(cf, "movie_voting"); val = Config_Get(cf, "user_vote_weight"); if ( val != NULL ) user_vote_weight = atof(val);

/* Process user voting parameters. */ if ( user_voting && Config_Set_Section(cf, "user_voting") ) { user_force_vote_in_Voter_Loop = _is_enabled(cf, \ "force_vote_in_Voter_Loop"); user_force_vote_after_Voter_Loop = _is_enabled(cf, \ "force_vote_after_Voter_Loop"); user_reset_support = _is_enabled(cf, "reset_support"); user_boundary_override = _is_enabled(cf, "boundary_override"); if ( user_boundary_override ) { val = Config_Get(cf, "facz"); if ( val != NULL ) user_facz = atof(val); val = Config_Get(cf, "thrz"); if ( val != NULL ) user_thrz = atof(val); }

/* Process user voting parameters. */ if(user_voting && Config_Set_Section(cf,"user_voting")){ user_force_vote_in_Voter_Loop = _is_enabled(cf, \ "force_vote_in_Voter_Loop"); user_force_vote_after_Voter_Loop = _is_enabled(cf, \ "force_vote_after_Voter_Loop"); user_reset_support = _is_enabled(cf, "reset_support"); user_boundary_override=_is_enabled(cf,"boundary_override"); if ( user_boundary_override ) { val = Config_Get(cf, "facz"); if ( val != NULL ) user_facz = atof(val); val = Config_Get(cf, "thrz"); if ( val != NULL ) user_thrz = atof(val); }

Page 50: The $1,000,000 Netflix Contest

PredictionConfig.C page 4 _set_internal_prune(cf,&dvCorp,"dvCorp","dvThrp","dvCorpWeight"); _set_internal_prune(cf,&dvCors,"dvCors","dvThrs","dvCorsWeight"); _set_internal_prune(cf,&vdCorp,"vdCorp","vdThrp","vdCorpWeight"); _set_internal_prune(cf,&vdCors,"vdCors","vdThrs","vdCorsWeight"); _set_internal_prune(cf,&pCor,"pCor","pThr","pCorWeight"); _set_internal_prune(cf,&dCor,"dCor","dThr","dCorWeight"); _set_internal_prune(cf,&sCor,"sCor","sThr","sCorWeight"); _set_stddev_prune(cf,&dUVsdp,"dUVsdp","dUVsdpThr","dUVsdpExp"); _set_stddev_prune(cf,&dUVsds,"dUVsds","dUVsdsThr","dUVsdsExp"); _set_stddev_prune(cf,&Vsdp_Usdp,"Vsdp_Usdp","Vsdp_UsdpThr", "Vsdp_UsdpExp"); _set_stddev_prune(cf,&Vsds_Usds,"Vsds_Usds","Vsds_UsdsThr", "Vsds_UsdsExp"); Prune_Movies_in_SupU.enabled=_is_enabled(cf, "Prune_Movies_in_SupU"); Prune_Users_in_SupM.enabled=_is_enabled(cf, "Prune_Users_in_SupM"); Prune_Movies_in_CoSupUV.enabled=_is_enabled(cf, "Prune_Movies_in_CoSupUV"); if(Prune_Movies_in_SupU.enabled)_set_external_prune(cf, &Prune_Movies_in_SupU,"user_voting Prune_Movies_in_SupU"); if(Prune_Users_in_SupM.enabled)_set_external_prune(cf, &Prune_Users_in_SupM,"user_voting Prune_Users_in_SupM"); if(Prune_Movies_in_CoSupUV.enabled)_set_external_prune(cf, &Prune_Movies_in_CoSupUV,"user_voting Prune_Movies_in_CoSupUV");} /* Process movie voting configuration. */ if ( movie_voting && Config_Set_Section(cf, "movie_voting")){ movie_force_vote_in_Voter_Loop=_is_enabled(cf, "force_vote_in_Voter_Loop"); movie_force_vote_outside_Voter_Loop=_is_enabled (cf, "force_vote_outside_Voter_Loop");

movie_reset_support = _is_enabled(cf,"reset_support"); movie_boundary_override = _is_enabled(cf,"boundary_override"); if(movie_boundary_override) { val=Config_Get(cf, "facz"); if(val!=NULL)movie_facz=atof(val); val = Config_Get(cf, "thrz"); if ( val != NULL ) movie_thrz = atof(val); } _set_internal_prune(cf,&DVCorp,"DVCorp","DVThrp","DVCorpWeight"); _set_internal_prune(cf,&DVCors,"DVCors","DVThrs","DVCorsWeight"); _set_internal_prune(cf,&VDCorp,"VDCorp","VDThrp","VDCorpWeight"); _set_internal_prune(cf,&VDCors,"VDCors","VDThrs","VDCorsWeight"); _set_internal_prune(cf,&PCor, "PCor", "PThr", "PCorWeight"); _set_internal_prune(cf,&DCor, "DCor", "DThr", "DCorWeight"); _set_internal_prune(cf,&SCor, "SCor", "SThr", "SCorWeight"); _set_stddev_prune(cf, &dMNsdp, "dMNsdp", "dMNsdpThr","dMNsdpExp"); _set_stddev_prune(cf, &dMNsds, "dMNsds", "dMNsdsThr","dMNsdsExp"); _set_stddev_prune(cf,&Nsdp_Msdp,"Nsdp_Msdp","Nsdp_MsdpThr", "Nsdp_MsdpExp"); _set_stddev_prune(cf,&Nsds_Msds,"Nsds_Msds","Nsds_MsdsThr", "Nsds_MsdsExp");

Movie_Prune_Users_in_SupM.enabled=_is_enabled(cf, "Prune_Users_in_SupM"); Movie_Prune_Movies_in_SupU.enabled=_is_enabled(cf, "Prune_Movies_in_SupU"); Movie_Prune_Users_in_CoSupMN.enabled=_is_enabled(cf, "Prune_Users_in_CoSupMN");

if ( Movie_Prune_Users_in_SupM.enabled ) _set_external_prune(cf,&Movie_Prune_Users_in_SupM, "movie_voting Prune_Users_in_SupM"); if ( Movie_Prune_Movies_in_SupU.enabled ) _set_external_prune(cf,&Movie_Prune_Movies_in_SupU, "movie_voting Prune_Movies_in_SupU"); if ( Movie_Prune_Users_in_CoSupMN.enabled ) _set_external_prune(cf, \ &Movie_Prune_Users_in_CoSupMN, \ "movie_voting Prune_Users_in_CoSupMN"); } Config_Destroy(cf); return true; }

fputs("\t\t\tPruning method: ", output); switch ( sp->method ) { case UserPrune: fputs("UserPrune\n", output); break; case UserFastPrune: fputs("UserFastPrune\n", output); break; case UserCommonCoSupportPrune: fputs("UserCommonCoSupportPrune\n", output);break; case MoviePrune: fputs("MoviePrune\n", output); break; case MovieFastPrune: fputs("MovieFastPrune\n", output); break; case MovieCommonCoSupportPrune:fputs("MovieCommonCoSupportPrune\n", output);break; }

fprintf(output,"\t\t\t\tmstrt: %-llu\tmultiplier: %-7.2f\n", pp->mstrt,pp->mstrt_mult); fprintf(output,"\t\t\t\tustrt: %-llu\tmultiplier: %-7.2f\n", pp->ustrt,pp->ustrt_mult); fprintf(output,"\t\t\t\tTSa: %-7.2f\tTSb: %-7.2f\n", pp->TSa, pp->TSb); fprintf(output,"\t\t\t\tTdvp: %-7.2f\tTdvs: %-7.2f\n", pp->Tdvp,pp->Tdvs); fprintf(output,"\t\t\t\tTvdp: %-7.2f\tTvds: %-7.2f\n", pp->Tvdp,pp->Tvds); fprintf(output,"\t\t\t\tTD: %-7.2f\tTP: %-7.2f\n", pp->TD, pp->TP); fprintf(output,"\t\t\t\tPPm: %-7.2f\n", pp->PPm); fprintf(output,"\t\t\t\tTV: %-7.2f\tTSD: %-7.2f\n", pp->TV, pp->TSD); fprintf(output,"\t\t\t\tCh: %-7.2f\tCt: %-7.2f\n", pp->Ch, pp->Ct); return; }

Page 51: The $1,000,000 Netflix Contest

PredictionConfig.C page 5

/* Public method prints interpretation of configuration. * \param output file descriptor to be used for output.*/ void PredictionConfig::print(FILE *output) { if ( name == NULL ) fputc('\n', output); else fprintf(output, "\tName: %s\n\n", name); if ( user_voting ) { fputs("\tUser voting enabled.\n", output); fprintf(output, "\t\tUser vote weight: %-7.2f\n", \ user_vote_weight); fputs("\t\tForce vote in voter loop will be ", output); if(user_force_vote_in_Voter_Loop ) fputs("enabled.\n",output); else fputs("disabled.\n", output); fputs("\t\tForce vote after voter loop will be ", output); if(user_force_vote_after_Voter_Loop) fputs("enabled.\n",output); else fputs("disabled.\n", output); fputs("\t\tUser support ", output); if ( user_reset_support ) fputs("will be reset.\n", output); else fputs("will not be reset.\n", output); fputs("\t\tBoundary override ", output); if ( user_boundary_override ) { fputs("is enabled:\n", output);

fprintf(output, "\t\t\tfacz: %-7.2f\tthrz: %-7.2f\n", \ user_facz, user_thrz); } else fputs("not enabled.\n", output); _print_internal_prune(&dvCorp, "dvCorp", output); _print_internal_prune(&dvCors, "dvCors", output); _print_internal_prune(&vdCorp, "vdCorp", output); _print_internal_prune(&vdCors, "vdCors", output); _print_internal_prune(&pCor, "pCor", output); _print_internal_prune(&dCor, "dCor", output); _print_internal_prune(&sCor, "sCor", output); _print_stddev_prune(&dUVsdp, "dUVsdp", output); _print_stddev_prune(&dUVsds, "dUVsds", output); _print_stddev_prune(&Vsdp_Usdp,"Vsdp_Usdp",output); _print_stddev_prune(&Vsds_Usds,"Vsds_Usds", output); _print_external_prune(&Prune_Movies_in_SupU, \ "Prune_Movies_in_SupU", output); _print_external_prune(&Prune_Users_in_SupM, \ "Prune_Users_in_SupM", output); _print_external_prune(&Prune_Movies_in_CoSupUV, \ "Prune_Movies_in_CoSupUV", output); fputc('\n', output); } if ( movie_voting ) { fputs("\tMovie voting enabled.\n", output); fprintf(output, "\t\tMovie vote weight: %-7.2f\n", \ 1.0 - user_vote_weight); fputs("\t\tForce vote in voter loop will be ", output); if ( movie_force_vote_in_Voter_Loop ) fputs("enabled.\n", output); else fputs("disabled.\n", output); fputs("\t\tForce vote outside voter loop will be ",output); if ( movie_force_vote_outside_Voter_Loop ) fputs("enabled.\n", output); else fputs("disabled.\n", output); fputs("\t\tMovie support ", output);

if ( movie_boundary_override ) { fputs("is enabled:\n", output); fprintf(output, "\t\t\tfacz: %-7.2f\tthrz: %-7.2f\n", movie_facz, movie_thrz); } else fputs("not enabled.\n", output);

_print_internal_prune(&DVCorp, "DVCorp", output); _print_internal_prune(&DVCors, "DVCors", output); _print_internal_prune(&VDCorp, "vdCorp", output); _print_internal_prune(&VDCors, "vdCors", output); _print_internal_prune(&PCor, "pCor", output); _print_internal_prune(&DCor, "dCor", output); _print_internal_prune(&SCor, "sCor", output);

_print_stddev_prune(&dMNsdp, "dMNsdp", output); _print_stddev_prune(&dMNsds, "dMNsds", output); _print_stddev_prune(&Nsdp_Msdp,"Nsdp_Msdp",output); _print_stddev_prune(&Nsds_Msds,"Nsds_Msds",output);

_print_external_prune(&Movie_Prune_Users_in_SupM, "Prune_Users_in_SupM", output); _print_external_prune(&Movie_Prune_Movies_in_SupU, "Prune_Movies_in_SupU", output); _print_external_prune(&Movie_Prune_Users_in_CoSupMN, "Prune_Users_in_CoSupMN", output); fputc('\n', output); }

return; }

Page 52: The $1,000,000 Netflix Contest

PredictionConfig.H#if !defined(PREDICTIONCONFIG_H)#define PREDICTIONCONFIG_H/* Standard include files. */#include <stdio.h>/* Local include files. *//* Enumeration types describing various * methods of internal pruning. */enum internal_pruning {user_dvCorp, user_dvCors, user_vdCorp,user_vdCors, user_pCor, user_dCor, user_sCor,

movie_DVCorp, movie_DVCors, movie_VDCorp,movie_VDCors, movie_PCor, movie_DCor, movie_SCor,

/* Standard deviation types. */user_dUVsdp, user_dUVsds, user_Vsdp_Usdp, user_Vsds_Usds,movie_dMNsdp,movie_dMNsds,movie_Nsdp_Msdp,movie_Nsds_Msds};

/* following structure def used to generically describe * internal or standard deviation pruning methods. */struct pruning { bool enabled; bool weight; double threshold; double exponent; };/* following structure def used to encapsulate parameters * which configure the external pruning routines. */struct pruning_parameters { unsigned long long int mstrt, ustrt; double mstrt_mult, ustrt_mult, TSa, TSb, Tdvp, Tdvs, Tvdp, Tvds, TD, TP, PPm, TV, TSD, Ch, Ct; };/* following structure definition is used to encapsulate * information for the external pruning routines. */enum prune_type { UserPrune, UserFastPrune, UserCommonCoSupportPrune, MoviePrune,MovieFastPrune,MovieCommonCoSupportPrune};struct external_prune {bool enabled; enum prune_type method; struct pruning_parameters params;};

/* Standard deviation pruning. */ struct pruning dUVsdp, dUVsds, Vsdp_Usdp, Vsds_Usds; struct external_prune Prune_Movies_in_SupU, Prune_Users_in_SupM, Prune_Movies_in_CoSupUV; /* Movie voting parameters. */ bool movie_force_vote_in_Voter_Loop, movie_force_vote_outside_Voter_Loop; bool movie_reset_support; bool movie_boundary_override; double movie_facz, movie_thrz; /* Internal vote pruning. */ struct pruning DVCorp, DVCors, VDCorp, VDCors, PCor, DCor, SCor; /* Standard deviation pruning. */ struct pruning dMNsdp, dMNsds, Nsdp_Msdp, Nsds_Msds; struct external_prune Movie_Prune_Users_in_SupM, Movie_Prune_Movies_in_SupU,Movie_Prune_Users_in_CoSupMN;

/* Name of the prediction configuration. */ char *name; /* First section of variables affect global prediction * parameters at level of mpp-user.C file. Subsequent * var sections will provide params specific to either * user or movie voting. */ bool user_voting, movie_voting; double user_vote_weight;

/* User voting parameters. */ bool user_force_vote_in_Voter_Loop, user_force_vote_after_Voter_Loop; bool user_reset_support; bool user_boundary_override; double user_facz, user_thrz;

/* Internal vote pruning. */ struct pruning dvCorp, dvCors, vdCorp, vdCors, pCor, dCor, sCor;

public: /* Void constructor. */ PredictionConfig(void); /* Destructor. */ ~PredictionConfig(void); /* Inline public methods to return type of voting. */ inline bool do_user_voting(void) {return user_voting;} inline bool do_movie_voting(void) {return movie_voting;} /* Inline public method to access user vote weight. */ inline double get_user_vote_weight(void) {return user_vote_weight;}

/* Inline public methods to determine whether user and * movie support be reset after initial external pruning */ inline bool reset_user_support(void) {return user_reset_support;} inline bool reset_movie_support(void) {return movie_reset_support;} /* Inline public methods to return status of vote forcing. */ inline bool user_vote_force_in_loop(void) { return user_force_vote_in_Voter_Loop;} inline bool user_vote_force_after_loop(void) { return user_force_vote_after_Voter_Loop;} inline bool movie_vote_force_in_loop(void) { return movie_force_vote_in_Voter_Loop;} inline bool movie_vote_force_after_loop(void) { return movie_force_vote_outside_Voter_Loop;} /* Inline accessor functions for returning external pruning conf. */ inline struct external_prune *get_user_Prune_Movies_in_SupU(void) { return &Prune_Movies_in_SupU; } inline struct external_prune *get_user_Prune_Users_in_SupM(void) { return &Prune_Users_in_SupM; } inline struct external_prune *get_user_Prune_Movies_in_CoSupUV(void){ return &Prune_Movies_in_CoSupUV; } inline struct external_prune *get_movie_Prune_Users_in_SupM(void) { return &Movie_Prune_Users_in_SupM;} inline struct external_prune *get_movie_Prune_Movies_in_SupU(void) { return &Movie_Prune_Movies_in_SupU;} inline struct external_prune *get_movie_Prune_Users_in_CoSupMN(void){ return &Movie_Prune_Users_in_CoSupMN;} /* Public method to read a configuration file. */ bool read_config(const char *); /*Public accessor for returning ptr to structure of internal prune. */ struct pruning *get_internal_prune(enum internal_pruning); /* Public method to print out a configuration. */ void print(FILE *);};#endif

Page 53: The $1,000,000 Netflix Contest

UserSet.C/* System include files. */#include <limits.h>

/* Local include files. */#include "UserSet.H"

/* Variables static to this module. */

/* No arguement constructor. */UserSet::UserSet(void) : ptree_set(){ id_numbers = NULL; return; }

/* Destructor. */UserSet::~UserSet(void){ if ( id_numbers != NULL ) free(id_numbers); return; }

/** * Public method. * Calculates the rating a user provided for a movie. * \param user The index number of the user. * \param movie The index number of the movie. * \return Rating number is returned to the caller. */double UserSet::get_rating(unsigned long int user_index, unsigned long int movie_index){ auto int rating = 0, val = 1; auto size_t slot = user_index * 3;

for (int tree= 2; tree >= 0; --tree) { if(ptree_set[slot+tree].is_set(movie_index)) rating+=val;val<<=1;} return rating; }

/* Public method. * Calculates mean rating of a group of movies by user * Specification of grp of movies is provided by bit mask * in the PTree supplied as an arguement to this method. * \param user_index Index number of the user for whom * the mean is being predicted. * \param cosupport Mask specifying group of movies for * mean is to be calculated. * \return Rating number returned to caller.*/double UserSet::get_mean(unsigned long int user_index, PTree &cosupport) {auto size_t slot = user_index * 3; auto double mean = 0; auto PTree bitcolumn;/* Iterate over the three bit positions which represent * movie ratings. Multiply number of 1 bits in * bit column by bitvalue of the tree. */ for (int tree=2, bit=0; tree>=0; --tree, ++bit) { bitcolumn = cosupport & ptree_set[slot+tree]; mean += bitcolumn.get_count()*pow(2.0, bit);}

/* Divide by the number of movies in the cosupport list to * complete the mean. */ return mean / cosupport.get_count();}/* Public method returns a PTree describing the set * of movies which a user has rated. */PTree UserSet::get_movies(unsigned long int index){ auto size_t slot = index * 3; return ptree_set[slot] | ptree_set[slot+1] | ptree_set[slot+2];}

/* Public method. * This method converts a user identity number into the index value * used to reference the PTree's corresponding to this user. */unsigned long int UserSet::get_index(unsigned long int id_number){ auto unsigned long int identities = ptree_set.size() / 3; for (unsigned long int lp= 0; lp < identities; ++lp) if ( id_numbers[lp] == id_number) return lp; return 0;}

/*Public method used to obtain pointers to rating PTree's of given * user. Each user has 3 associated PTree's corresponding to one of * the three bits used to represent movie ratings. The zeroth PTree * represents the high order bit of the rating value. * \param Index number of user whose rating is to be returned. * \param Bit position PTree to be returned. * \return NULL is returned if an invalid PTree is requested. * Else requested PTree is returned to the caller. */PTree UserSet::get_ptree(unsigned long int user_index, int bit){ auto size_t slot = user_index * 3; if ( slot >= ptree_set.size() ) return NULL; return ptree_set[slot + bit];}

/*Public method. * This method sets up an array containing the numerical value of * user identity numbers corresponding to the PTree slots. * \return A boolean return value is used to indicate success * or failure of the load. */bool UserSet::load_identities(void){ auto size_t cnt = 0, number_of_identities; auto char *p, bufr[PATH_MAX]; auto FILE *input;

Page 54: The $1,000,000 Netflix Contest

UserSet.C page 2

/*Public method. * sets up an array containing the numerical value of * user identity numbers corresponding to the PTree slots. * \return Boolean return is used to indicate success * or failure of the load. */bool UserSet::load_identities(void){ auto size_t cnt = 0, number_of_identities; auto char *p, bufr[PATH_MAX]; auto FILE *input;

/* PTreeSet must be loaded. */ if ( ptree_set.size() == 0 ) return false;

/* Allocate an array of integers to hold identities*/ number_of_identities = ptree_set.size() / 3; id_numbers=(unsigned long int *)malloc(number_of_identities* sizeof(unsigned long int)); if ( id_numbers == NULL ) return false;

/* Read file and convert identities to integers. */ snprintf(bufr, sizeof(bufr), \ "%s/mpred-data/nf_mv_us_pt/user-attributes.txt",PTREEDATA); input = fopen(bufr, "r"); if ( input == NULL ) return false; while ( !feof(input) ) { if(fgets(bufr, sizeof(bufr),input)==NULL) return false; if ((p=strrchr(bufr, '\n')) != NULL) *p = '\0'; id_numbers[cnt++] = strtoul(bufr, NULL, 10); if ( cnt == number_of_identities ) return true; }

if ( cnt != number_of_identities ) return false; return true;}

/* Public method. * This method dumps all component PTree's of set of moves. * \param output Output descriptor where the PTree's are * to be directed. */void UserSet::dump(FILE *output){ for (int lp= 0; lp < ptree_set.size(); ++lp) ptree_set[lp].dump(output); return;} for (unsigned long int lp= 0; lp < identities; ++lp) fprintf(output, "%lu -> %lu\n", lp, id_numbers[lp]); return;}

#if !defined(USERSET_H) #define USERSET_H

/* Standard include files. */ #include <stdio.h> #include <math.h> /* Local include files. */ #include "PTreeSet.H"

class UserSet { private: unsigned long int *id_numbers; PTreeSet ptree_set; public: /* Void constructor. */ UserSet(void);

/* Constructor to initialize an in-memory tree. */ /* Destructor. */ ~UserSet(void);

/* Public inline method to return the identity of a user index*/ unsigned long int get_identity(unsigned long int index) { return id_numbers[index]; }

/* Public method to return the index of a user identity. */ unsigned long int get_index(unsigned long int); /* Public method to return the rating of a movie by a user. */ double get_rating(unsigned long int, unsigned long int); /* Public method to return the mean rating of a set of movies.*/ double get_mean(unsigned long int, PTree &); /* Public method to return the set of movies rated by a user. */ PTree get_movies(unsigned long int); /* Public method to return rating PTree's. */ PTree get_ptree(unsigned long int, int); /* Public method to load a list of user identities. */ bool load_identities(void); /* Public method to print sparseness of set. */ void dump(FILE *); /* Public method to load a set of PTree's saved in ASCII format*/ bool load(FILE *); /* Public method to load a binary PTree set. */ bool load_binary(void); /* Public method to print index/attribute pairings. */ void print(FILE *); }; #endif

UserSet.H

Page 55: The $1,000,000 Netflix Contest

Sample config file # Sample prediction configuration file.# Name of the configuration.name = default

# Allow user voting: enabled or disableduser_voting = disabled

# Do movie based voting: enabled or disabledmovie_voting = enabled

# User vote weighting. Movie vote weighting will be derived from# the value of this variable.user_vote_weight = 0

# User voting configuration.# This section is only processed if user voting is enabled.[user_voting]

# The following options specify where and if votes are forced into# their standard range of 1-5.force_vote_in_Voter_Loop = disabledforce_vote_after_Voter_Loop = disabled

# The following variable controls whether or not user support is reset# after user pruning is completed.reset_support = disabled

# The following variables control Boundary Based prediction overrides.# The parameters are only evaluated if the boundary based method is# enabled.# boundary_override = disabled;# facz = 0# thrz = 0

# Internal pruning configuration.# One or more of the pruning functions can be enabled.# For each pruning type a default threshold can be set. If not set the# default value indicated below is used.# The third variable selects a vote weighting option. If the weight variant# of the pruning method is enabled the value of uCor is set to that value.# Note that the last enabled weight will set uCor.

# dvCorp = disabled# dvThrp = 0# dvCorpWeight = disabled

# dvCors = disabled# dvThrs = 0# dvCorsWeight = disabled

# vdCorp = disabled# vdThrp = 0# vdCorpWeight = disabled

# vdCors = disabled# vdThrs = 0# vdCorsWeight = disabled

# pCor = disabled# pThr = 0# pCorWeight = disabled

# dCor = disabled# dThr = 0# dCorWeight = disabled

# sCor = disabled# sThr = 0# sCorWeight = disabled

# Standard deviation pruning.# One of more of the following methods can be selected. The default is# for all these methods to be disabled.# Each pruning method has a threshold and exponent value associated with# it. The defaults values are noted in the definitions below.

# dUVsdp = disabled# dUVsdpThr = 0# dUVsdpExp = -1

# dUVsds = disabled# dUVsdsThr = 0# dUVsdsExp = -1

# Vsdp_Usdp = disabled# Vsdp_UsdpThr = 0# Vsdp_UsdpExp = -1

# Vsds_Usds = disabled# Vsds_UsdsThr = 0# Vsds_UsdsExp = -1

Page 56: The $1,000,000 Netflix Contest

Sample config file - pg 2 # External pruning configuration# The following section selects the use of any combination of three# pruning functions. By default pruning is disabled.# Each pruning method is encapsulated in its own section. This allows# a pruning configuration to be turned on and off without disturbing# the pruning configuration.# Within each pruning section there are six different methods for# implementing the pruning. These methods are:# UserPrune, UserFastPrune, UserCommonCoSupportPrune# MoviePrune, MovieFastPrune, MovieCommonCosupportPrune# There are a total of 15 parameters which select the configuration of# of the pruning. Default values are noted.

Prune_Movies_in_SupU = disabledPrune_Users_in_SupM = disabledPrune_Movies_in_CoSupUV = disabled

[user_voting Prune_Movies_in_SupU]

method = UserPruneleftside = 0width = 0mstrt = 0mstrt_mult = 0.0ustrt = 0ustrt_mult = 0.0TSa = -100TSb = -100Tdvp = -1Tdvs = -1Tvdp = -1Tvds = -1TD = -1TP = -1PPm = .1TV = -1TSD = -1Ch = 1Ct = 2

[user_voting Prune_Users_in_SupM]

method = UserPruneleftside = 0width = 0

[user_voting Prune_Moviesin_CoSupUV]

method = UserPruneleftside = 0width = 0mstrt = 0mstrt_mult = 0.0ustrt = 0ustrt_mult = 0.0TSa = -100TSb = -100Tdvp = -1Tdvs = -1Tvdp = -1Tvds = -1TD = -1TP = -1PPm = .1TV = -1TSD = -1Ch = 1Ct = 2

mstrt = 0mstrt_mult = 0.0ustrt = 0ustrt_mult = 0.0TSa = -100TSb = -100Tdvp = -1Tdvs = -1Tvdp = -1Tvds = -1

TD = -1TP = -1PPm = .1TV = -1TSD = -1Ch = 1Ct = 2

# Movie voting configuration.# This section is only processed if the# movie_vote variable is set to# enabled in the Default section.[movie_voting]

# The following options specify where and if votes are forced into# their standard range of 1-5.force_vote_in_Voter_Loop = disabledforce_vote_outside_Voter_Loop = disabled

# The following variable controls whether or not user support is reset# after user pruning is completed.reset_support = disabled

# following variables control Boundary Based prediction overrides.# Parameters are only evaluated if boundary based method is enabled.# boundary_override = disabled;# facz = 0# thrz = 0

# Internal pruning configuration.# One or more of the pruning functions can be enabled.# For each pruning type a default threshold can be set. If not set the# default value indicated below is used.# Third variable selects a vote weighting option. If weight variant# of pruning method is enabled the value of uCor is set to that value.# Note that the last enabled weight will set uCor.

# DVCorp = disabled# DVThrp = 0# DVCorpWeight = disabled

# DVCors = enabled# DVThrs = 0# DVCorsWeight = disabled

# VDCorp = enabled# VDThrp = 0# VDCorpWeight = disabled

# VDCors = enabled# VDThrs = 0# VDCorsWeight = disabled

# PCor = enabled# PThr = 0# PCorWeight = disabled

# DCor = enabled# DThr = 0# DCorWeight = disabled

# SCor = enabled# SThr = 0# SCorWeight = disabled

Page 57: The $1,000,000 Netflix Contest

Sample config file - pg 3 # Standard deviation pruning.# One of more of the following methods can be selected. The default is# for all these methods to be disabled.# Each pruning method has a threshold and exponent value associated with# it. The defaults values are noted in the definitions below.

# dMNsdp = disabled# dMNsdpThr = 0# dMNsdpExp = -1

# dMNsds = disabled# dMNsdsThr = 0# dMNsdsExp = -1

# Nsdp_Msdp = enabled# Nsdp_MsdpThr = 0# Nsdp_MsdpExp = -1

# Nsds_Msds = enabled# Nsds_MsdsThr = 0# Nsds_MsdsExp = -1

# External pruning configuration# The following section selects the use of any combination of three# pruning functions. By default pruning is disabled.# Each pruning method is encapsulated in its own section. This allows# a pruning configuration to be turned on and off without disturbing# the pruning configuration.# Within each pruning section there are six different methods for# implementing the pruning. These methods are:# UserPrune, UserFastPrune, UserCommonCoSupportPrune# MoviePrune, MovieFastPrune, MovieCommonCoSupportPrune# There are a total of 15 parameters which select the configuration of# of the pruning. Default values are noted.

Prune_Users_in_SupM = disabledPrune_Movies_in_SupU = enabledPrune_Users_in_CoSupMN = enabled

[movie_voting Prune_Users_in_SupM]method = UserPrune

leftside = 0width = 0mstrt = 0mstrt_mult = 0.0ustrt = 0ustrt_mult = 0.0TSa = -100TSb = -100Tdvp = -1Tdvs = -1Tvdp = -1Tvds = -1TD = -1TP = -1PPm = .1TV = -1TSD = -1Ch = 1Ct = 2

[movie_voting Prune_Movies_in_SupU]method = MovieCommonCoSupportPruneleftside = +40width = 10mstrt = 0mstrt_mult = 0.0ustrt = 0ustrt_mult = 0.0TSa = -100TSb = -100Tdvp = -1Tdvs = -1Tvdp = -1Tvds = -1TD = -1TP = -1PPm = .1TV = -1TSD = -1Ch = 1Ct = 2

[movie_voting Prune_Users_in_CoSupMN]method = UserCommonCoSupportPruneleftside = 0width = 8000mstrt = 0mstrt_mult = 0.0ustrt = 0ustrt_mult = 0.0TSa = -100TSb = -100Tdvp = -1Tdvs = -1Tvdp = -1Tvds = -1TD = -1TP = -1PPm = .1TV = -1TSD = -1Ch = 1