collaborative filtering for recommendation systems in python, nicolas hug
TRANSCRIPT
TAXONOMYContent-basedLeverageinformationabouttheitemscontentBasedonitemprofiles(metadata).
CollaborativefilteringLeveragesocialinformation(user-iteminteractions)E.g.:recommenditemslikedbymypeers.Usuallyyieldsbetterresults.
Knowledge-basedLeverageusersrequirementsUsedforcars,loans,realestate...Verytask-specific.
5
CollaborativefilteringLeveragesocialinformation(user-iteminteractions)E.g.:recommenditemslikedbymypeers.Usuallyyieldsbetterresults.
5
OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)2. OVERVIEWOF
( )
SURPRISEhttp://surpriselib.com
8
OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)2. OVERVIEWOF
( )
SURPRISEhttp://surpriselib.com
3. MATRIXFACTORIZATION
8
NEIGHBORHOODMETHODWehaveahistoryofpastratingsWeneedtopredictAlice'sratingforTitanic
1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)
2. AveragetheirratingsforTitanic
That'sit10
WENEEDASIMILARITYMEASURE
numberofcommonrateditemsaverageabsolutedifferencebetweenratings(it'sactuallya
distance)
11
WENEEDASIMILARITYMEASURE
numberofcommonrateditemsaverageabsolutedifferencebetweenratings(it'sactuallya
distance)
cosineanglebetween and
11
WENEEDASIMILARITYMEASURE
numberofcommonrateditemsaverageabsolutedifferencebetweenratings(it'sactuallya
distance)
cosineanglebetween andPearsoncorrelationcoefficientbetween and
11
WENEEDASIMILARITYMEASURE
numberofcommonrateditemsaverageabsolutedifferencebetweenratings(it'sactuallya
distance)
cosineanglebetween andPearsoncorrelationcoefficientbetween and
About3millionsothermeasures11
NEIGHBORHOODMETHODWehaveahistoryofpastratingsWeneedtopredictAlice'sratingforTitanic
1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)
2. AveragetheirratingsforTitanic
That'sit12
AGGREGATINGTHENEIGHBORS'RATINGS
AliceisclosetoBob.HerratingforTitanicisprobablyclosetoBob's(1).So
13
AGGREGATINGTHENEIGHBORS'RATINGS
AliceisclosetoBob.HerratingforTitanicisprobablyclosetoBob's(1).SoButweshouldofcourselookatmorethanoneneighbor!
13
LET'SCODE!defpredict_rating(u,i):'''Returnestimatedratingofuseruforitemi.rating_histisalistoftuples(user,item,rating)'''
#Retrieveusershavingratedineighbors=[(sim[u,v],r_vj)for(v,j,r_vj)inrating_histif(i==j)]#Sortthembysimilaritywithuneighbors.sort(key=lambdatple:tple[0],reversed=True)#Computeweightedaverageofthek-NN'sratingsnum=sum(sim_uv*r_vifor(sim_uv,r_vi)inneighbors[:k])denum=sum(sim_uvfor(sim_uv,_)inneighbors[:k])
returnnum/denum
14
LET'SCODE!defpredict_rating(u,i):'''Returnestimatedratingofuseruforitemi.rating_histisalistoftuples(user,item,rating)'''
#Retrieveusershavingratedineighbors=[(sim[u,v],r_vj)for(v,j,r_vj)inrating_histif(i==j)]#Sortthembysimilaritywithuneighbors.sort(key=lambdatple:tple[0],reversed=True)#Computeweightedaverageofthek-NN'sratingsnum=sum(sim_uv*r_vifor(sim_uv,r_vi)inneighbors[:k])denum=sum(sim_uvfor(sim_uv,_)inneighbors[:k])
returnnum/denum
14
NEIGHBORHOODMETHODWehaveahistoryofpastratingsWeneedtopredictAlice'sratingforTitanic
1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)
2. AveragetheirratingsforTitanic
That'sit... 15
NEIGHBORHOODMETHODWehaveahistoryofpastratingsWeneedtopredictAlice'sratingforTitanic
1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)
2. AveragetheirratingsforTitanic
That'sit...?15
THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS
NormalizetheratingsRemovebias(someusersaremean)Useafancieraggregation
16
THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS
NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)
16
THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS
NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinstead
16
THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS
NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinsteadOrusebothkindsofsimilarities!
16
THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS
NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinsteadOrusebothkindsofsimilarities!Clusterusersand/oritems
16
THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS
NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinsteadOrusebothkindsofsimilarities!Clusterusersand/oritemsLearnthesimilarities
16
THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS
NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinsteadOrusebothkindsofsimilarities!Clusterusersand/oritemsLearnthesimilaritiesBlahblahblah...
16
OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)2. OVERVIEWOF
( )3. MATRIXFACTORIZATION
SURPRISEhttp://surpriselib.com
17
SURPRISEAPythonlibraryforrecommendersystems
(Orrather:aPythonlibraryforratingpredictionalgorithms)
18
YES:)
NO:(
clf=MyClassifier()clf.fit(X_train,y_train)clf.predict(X_test)
rec=MyRecommender()rec.fit(X_train,y_train)rec.predict(X_test)
21
BASICUSAGEfromsurpriseimportKNNBasicfromsurpriseimportDataset
data=Dataset.load_builtin('ml-100k')#downloaddatasettrainset=data.build_full_trainset()#buildtrainset
algo=KNNBasic()#usebuilt-inpredictionalgorithm
algo.train(trainset)#fitdata
algo.predict('Alice','Titanic')algo.predict('Bob','ToyStory')#...
22
DATASETHANDLING#Built-indatasets(movielens,Jester)data=Dataset.load_builtin('ml-100k')
#Customdatasets(fromfile)file_path=os.path.expanduser('./my_data')reader=Reader(line_format='useritemratingtimestamp',sep='\t')data=Dataset.load_from_file(file_path,reader=reader)
#Customdatasets(frompandasdataframe)reader=Reader(rating_scale=(1,5))data=Dataset.load_from_df(df[['uid','iid','r_ui']],reader)
24
MAINFEATURES
Built-inpredictionalgorithms(SVD,k-NN,manyothers)Built-insimilaritymeasures(Cosine,Pearson...)
25
BUILT-INALGORITHMSANDSIMILARITIES
SVD,SVD++,NMF, -NN(withvariants),SlopeOne,somebaselines...Cosine,Pearson,MSD(betweenusersoritems)sim_options={'name':'cosine','user_based':False,'min_support':10}algo=KNNBasic(sim_options=sim_options)
26
CUSTOMPREDICTIONALGORITHMclassMyRatingPredictor(AlgoBase):
defestimate(self,u,i):return3
algo=MyRatingPredictor()algo.train(trainset)pred=algo.predict('Alice','Titanic')#willcallestimate
28
CUSTOMPREDICTIONALGORITHMclassMyRatingPredictor(AlgoBase):
deftrain(self,trainset):'''Fityourdatahere.'''
self.mean=np.mean([r_uifor(u,i,r_ui)intrainset.all_ratings()])
defestimate(self,u,i):returnself.mean
29
CUSTOMPREDICTIONALGORITHMclassMyRatingPredictor(AlgoBase):
deftrain(self,trainset):'''Fityourdatahere.'''
self.mean=np.mean([r_uifor(u,i,r_ui)intrainset.all_ratings()])
defestimate(self,u,i):returnself.mean
trainsetobject:Iteratorsovertheratinghistory(user-wise,item-wise,orunordered)IteratorsoverusersanditemsidsOtherusefulmethods/attributes
29
classMyKNN(AlgoBase):
deftrain(self,trainset):self.trainset=trainsetself.sim=...#Computesimilaritymatrix
defestimate(self,u,i):
neighbors=[(self.sim[u,v],r_vi)for(v,r_vj)inself.trainset.ur[u]]neighbors=sorted(neighbors,key=lambdatple:tple[0],reverse=True)
sum_sim=sum_ratings=0for(sim_uv,r_vi)inneighbors[:self.k]:sum_sim+=simsum_ratings+=sim*r
returnsum_ratings/sum_sim
30
TYPICALEVALUATIONPROTOCOL
Hidesomeofthe (theybecome )
Usetheremaining topredictthe
RMSE=
Theaverageerror,wherebigerrorsareheavilypenalized(squared)
32
TYPICALEVALUATIONPROTOCOL
Hidesomeofthe (theybecome )
Usetheremaining topredictthe
RMSE=
Theaverageerror,wherebigerrorsareheavilypenalized(squared)Dothatmanytimeswithdifferentsubsets:thisiscross-validation
32
CROSSVALIDATIONdata=Dataset.load_builtin('ml-100k')#downloaddatasetdata.split(n_folds=3)#3-foldscrossvalidation
algo=SVD()
fortrainset,testsetindata.folds():algo.train(trainset)#fitdatapredictions=algo.test(testset)#predictratingsaccuracy.rmse(predictions,verbose=True)#printRMSE
33
GRID-SEARCHparam_grid={'n_epochs':[5,10],'lr_all':[0.002,0.005],'reg_all':[0.4,0.6]}grid_search=GridSearch(SVD,param_grid,measures=['RMSE','FCP'])
data=Dataset.load_builtin('ml-100k')data.split(n_folds=3)
grid_search.evaluate(data)#willevaluateeachcombination
print(grid_search.best_score['RMSE'])print(grid_search.best_params['RMSE'])algo=grid_search.best_estimator['RMSE'])
34
EVALUATIONTOOLS
CanalsocomputeRecall,Precisionandtop-Nrelatedmeasureseasily
fortrainset,testsetindata.folds():algo.train(trainset)#fitdatapredictions=algo.test(testset)#predictratings
accuracy.rmse(predictions)accuracy.mae(predictions)accuracy.fcp(predictions)
36
MODELPERSISTENCE
Canalsodumpthepredictionstoanalyzeandcarefullycomparealgorithmswithpandas
#...gridsearchalgo=grid_search.best_estimator['RMSE'])
#dumpalgorithmfile_name=os.path.expanduser('~/dump_file')dump.dump(file_name,algo=algo)
#...#ClosePython,gotobed#...
#Wakeup,reloadalgorithm,profit_,algo=dump.load(file_name)
38
OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)2. OVERVIEWOF
( )3. MATRIXFACTORIZATION
SURPRISEhttp://surpriselib.com
39
MATRIXFACTORIZATIONAlotofhypeduringtheNetflixPrize(2006-2009:improveoursystem,getrich)ModeltheratingsinaninsightfulwayTakesitsrootindimensionalreductionandSVD
41
PCAalsogivesyouthe .
Inadvance:youdon'tneedallthe400guys
BEFORESVD:PCAPCAwillreveal400ofthosecreepytypicalguys:
Theseguyscanbuildbackalloftheoriginalfaces
43
PCAONARATINGMATRIX?SURE!Assumeallratingsareknown
Exactsamething!Wejusthaveratingsinsteadofpixels.
PCAwillrevealtypicalusers.
44
PCAONARATINGMATRIX?SURE!Assumeallratingsareknown
Exactsamething!Wejusthaveratingsinsteadofpixels.
PCAwillrevealtypicalusers.
44
PCAONARATINGMATRIX?SURE!Assumeallratingsareknown.Transposethematrix
Exactsamething!PCAwillrevealtypicalmovies.
45
PCAONARATINGMATRIX?SURE!Assumeallratingsareknown.Transposethematrix
Exactsamething!PCAwillrevealtypicalmovies.
Note:inpractice,thefactorssemanticisnotclearlydefined.
45
SVDISPCA2
PCAon givesyouthetypicalusersPCAon givesyouthetypicalmoviesSVDgivesyoubothinoneshot!
isdiagonal,it'sjustascaler.
Thisisourmatrixfactorization!
46
SOHOWTOCOMPUTE AND ?
Columnsof aretheeigenvectorsofColumnsof aretheeigenvectorsofAssociatedeigenvaluesmakeupthediagonalofThereareveryefficientalgorithmsinthewild
48
SOHOWTOCOMPUTE AND ?
Columnsof aretheeigenvectorsofColumnsof aretheeigenvectorsofAssociatedeigenvaluesmakeupthediagonalofThereareveryefficientalgorithmsinthewild
Alternateoption:findthe sandthe sthatminimizethereconstructionerror
(Withsomeorthogonalityconstraints).Therearealsoveryefficientalgorithmsinthewild
48
SOHOWTOCOMPUTE AND ?
Columnsof aretheeigenvectorsofColumnsof aretheeigenvectorsofAssociatedeigenvaluesmakeupthediagonalofThereareveryefficientalgorithmsinthewild
Alternateoption:findthe sandthe sthatminimizethereconstructionerror
(Withsomeorthogonalityconstraints).Therearealsoveryefficientalgorithmsinthewild
So,pieceofcakeright?
48
WEASSUMEDTHAT WASDENSEButit'snot!99%aremissing
and arenotevendefinedSo and arenotdefinedeitherThereisnoSVD
50
WEASSUMEDTHAT WASDENSEButit'snot!99%aremissing
and arenotevendefinedSo and arenotdefinedeitherThereisnoSVD
Twooptions:Fillthemissingentrieswithasimpleheuristic"Let'sjustnotgiveadamn"--SimonFunk(ended-uptop-3oftheNetflixPrizeforsometime)
50
Densecase:findthe sandthe sthatminimizethetotalreconstructionerror
(Withorthogonalityconstraints)
Sparsecase:findthe sandthe sthatminimizethepartialreconstructionerror
(Forgetaboutorthogonality)
APPROXIMATION
51
Densecase:findthe sandthe sthatminimizethetotalreconstructionerror
(Withorthogonalityconstraints)
Sparsecase:findthe sandthe sthatminimizethepartialreconstructionerror
(Forgetaboutorthogonality)
APPROXIMATION
Basicallythesame,exceptwedon'thavealltheratings(andwedon'tcare)
51
Wewantthevalue suchthat
isminimal:
ComputeRandomlyinitialize
(dothisuntilyou'refedup)
MINIMIZATIONBYGRADIENTDESCENT
52
MINIMIZATIONBY(STOCHASTIC)GRADIENTDESCENT
Ourparameter is forall and .Thefunction tobeminimizedis:
defcompute_SVD():'''FitpuandqitoknownratingsbySGD'''p=np.random.normal(size=F)q=np.random.normal(size=F)foriterinrange(n_max_iter):foru,i,r_uiinrating_hist:err=r_ui-np.dot(p[u],q[i])p[u]=p[u]+learning_rate*err*q[i]q[i]=q[i]+learning_rate*err*p[u]
defestimate_rating(u,i):returnnp.dot(p[u],q[i])
53
SOMELASTDETAILSUnbiastheratings,addregularization:youget"SVD":
Goodol'fashionedMLparadigm:Assumeamodelforyourdata( )FityourmodelparameterstotheobserveddataPraisetheLordandhopeforthebest
54
AFEWREMARKSismostlyatoolforalgorithmsthatpredict
ratings.RSdomuchmorethanthat.Focusonexplicitratings(implicitratingsalgorithmswillcome,hopefully)NotaimedtobeacompletetoolforbuildingRSfromAtoZ(checkout )Notaimedtobethemostefficientimplementation(checkout )
Surprise
Mangaki
LightFM
55
SOMEREFERENCESCan'trecommendenough(punintended)Aggarwal'sRecommenderSystems-TheTextbookJeremyKun's (greatinsightson and )blog PCA SVD
56