collaborative filtering for recommendation systems in python, nicolas hug

96
An introduction to COLLABORATIVE FILTERING IN PYTHON and an overview of Surprise 1

Upload: pole-systematic-paris-region

Post on 21-Jan-2018

2.234 views

Category:

Technology


1 download

TRANSCRIPT

Anintroductionto

COLLABORATIVEFILTERINGINPYTHON

andanoverviewofSurprise

1

WHATSHOULDIREAD?

2

WHATSHOULDISEE?

3

WHERESHOULDIEAT?

4

TAXONOMYContent-basedLeverageinformationabouttheitemscontentBasedonitemprofiles(metadata).

CollaborativefilteringLeveragesocialinformation(user-iteminteractions)E.g.:recommenditemslikedbymypeers.Usuallyyieldsbetterresults.

Knowledge-basedLeverageusersrequirementsUsedforcars,loans,realestate...Verytask-specific.

5

CollaborativefilteringLeveragesocialinformation(user-iteminteractions)E.g.:recommenditemslikedbymypeers.Usuallyyieldsbetterresults.

5

RATINGPREDICTION

Rowsareusers,columnsareitems~99%oftheentriesaremissingYourmission:predictthe

6

ABOUTMENicolasHugPhDStudent(UniversityofToulouseIII)NeededaPythonlibraryforRSresearch...Didit.

7

OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)

8

OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)2. OVERVIEWOF

( )

SURPRISEhttp://surpriselib.com

8

OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)2. OVERVIEWOF

( )

SURPRISEhttp://surpriselib.com

3. MATRIXFACTORIZATION

8

1. THENEIGHBORHOODMETHOD(K-NN)

8

NEIGHBORHOODMETHODS

9

NEIGHBORHOODMETHODWehaveahistoryofpastratingsWeneedtopredictAlice'sratingforTitanic

1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)

2. AveragetheirratingsforTitanic

That'sit10

1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)

10

WENEEDASIMILARITYMEASURE

11

WENEEDASIMILARITYMEASURE

numberofcommonrateditems

11

WENEEDASIMILARITYMEASURE

numberofcommonrateditemsaverageabsolutedifferencebetweenratings(it'sactuallya

distance)

11

WENEEDASIMILARITYMEASURE

numberofcommonrateditemsaverageabsolutedifferencebetweenratings(it'sactuallya

distance)

cosineanglebetween and

11

WENEEDASIMILARITYMEASURE

numberofcommonrateditemsaverageabsolutedifferencebetweenratings(it'sactuallya

distance)

cosineanglebetween andPearsoncorrelationcoefficientbetween and

11

WENEEDASIMILARITYMEASURE

numberofcommonrateditemsaverageabsolutedifferencebetweenratings(it'sactuallya

distance)

cosineanglebetween andPearsoncorrelationcoefficientbetween and

About3millionsothermeasures11

NEIGHBORHOODMETHODWehaveahistoryofpastratingsWeneedtopredictAlice'sratingforTitanic

1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)

2. AveragetheirratingsforTitanic

That'sit12

2. AveragetheirratingsforTitanic

12

AGGREGATINGTHENEIGHBORS'RATINGS

AliceisclosetoBob.HerratingforTitanicisprobablyclosetoBob's(1).So

13

AGGREGATINGTHENEIGHBORS'RATINGS

AliceisclosetoBob.HerratingforTitanicisprobablyclosetoBob's(1).SoButweshouldofcourselookatmorethanoneneighbor!

13

LET'SCODE!defpredict_rating(u,i):'''Returnestimatedratingofuseruforitemi.rating_histisalistoftuples(user,item,rating)'''

#Retrieveusershavingratedineighbors=[(sim[u,v],r_vj)for(v,j,r_vj)inrating_histif(i==j)]#Sortthembysimilaritywithuneighbors.sort(key=lambdatple:tple[0],reversed=True)#Computeweightedaverageofthek-NN'sratingsnum=sum(sim_uv*r_vifor(sim_uv,r_vi)inneighbors[:k])denum=sum(sim_uvfor(sim_uv,_)inneighbors[:k])

returnnum/denum

14

LET'SCODE!defpredict_rating(u,i):'''Returnestimatedratingofuseruforitemi.rating_histisalistoftuples(user,item,rating)'''

#Retrieveusershavingratedineighbors=[(sim[u,v],r_vj)for(v,j,r_vj)inrating_histif(i==j)]#Sortthembysimilaritywithuneighbors.sort(key=lambdatple:tple[0],reversed=True)#Computeweightedaverageofthek-NN'sratingsnum=sum(sim_uv*r_vifor(sim_uv,r_vi)inneighbors[:k])denum=sum(sim_uvfor(sim_uv,_)inneighbors[:k])

returnnum/denum

14

NEIGHBORHOODMETHODWehaveahistoryofpastratingsWeneedtopredictAlice'sratingforTitanic

1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)

2. AveragetheirratingsforTitanic

That'sit... 15

NEIGHBORHOODMETHODWehaveahistoryofpastratingsWeneedtopredictAlice'sratingforTitanic

1. FindtheusersthathavethesametastesasAlice(usingtheratinghistory)

2. AveragetheirratingsforTitanic

That'sit...?15

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

Normalizetheratings

16

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

NormalizetheratingsRemovebias(someusersaremean)

16

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

NormalizetheratingsRemovebias(someusersaremean)Useafancieraggregation

16

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)

16

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinstead

16

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinsteadOrusebothkindsofsimilarities!

16

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinsteadOrusebothkindsofsimilarities!Clusterusersand/oritems

16

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinsteadOrusebothkindsofsimilarities!Clusterusersand/oritemsLearnthesimilarities

16

THEREARE(APPROXIMATELY)HALFABILLIONVARIANTS

NormalizetheratingsRemovebias(someusersaremean)UseafancieraggregationDiscountsimilarities(givethemmoreorlessconfidence)Useitem-itemsimilarityinsteadOrusebothkindsofsimilarities!Clusterusersand/oritemsLearnthesimilaritiesBlahblahblah...

16

OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)2. OVERVIEWOF

( )3. MATRIXFACTORIZATION

SURPRISEhttp://surpriselib.com

17

2. OVERVIEWOF( )

SURPRISEhttp://surpriselib.com

17

SURPRISEAPythonlibraryforrecommendersystems

(Orrather:aPythonlibraryforratingpredictionalgorithms)

18

WHY?NeededaPythonlibforquickandeasyprototypingNeededtocontrolmyexperiments

19

SOWHYNOTSCIKIT-LEARN?

20

SOWHYNOTSCIKIT-LEARN?

Ratingprediction≠regressionorclassification

20

YES:)

NO:(

clf=MyClassifier()clf.fit(X_train,y_train)clf.predict(X_test)

rec=MyRecommender()rec.fit(X_train,y_train)rec.predict(X_test)

21

BASICUSAGEfromsurpriseimportKNNBasicfromsurpriseimportDataset

data=Dataset.load_builtin('ml-100k')#downloaddatasettrainset=data.build_full_trainset()#buildtrainset

algo=KNNBasic()#usebuilt-inpredictionalgorithm

algo.train(trainset)#fitdata

algo.predict('Alice','Titanic')algo.predict('Bob','ToyStory')#...

22

MAINFEATURESEasydatasethandling

23

DATASETHANDLING#Built-indatasets(movielens,Jester)data=Dataset.load_builtin('ml-100k')

#Customdatasets(fromfile)file_path=os.path.expanduser('./my_data')reader=Reader(line_format='useritemratingtimestamp',sep='\t')data=Dataset.load_from_file(file_path,reader=reader)

#Customdatasets(frompandasdataframe)reader=Reader(rating_scale=(1,5))data=Dataset.load_from_df(df[['uid','iid','r_ui']],reader)

24

MAINFEATURES

Built-inpredictionalgorithms(SVD,k-NN,manyothers)Built-insimilaritymeasures(Cosine,Pearson...)

25

BUILT-INALGORITHMSANDSIMILARITIES

SVD,SVD++,NMF, -NN(withvariants),SlopeOne,somebaselines...Cosine,Pearson,MSD(betweenusersoritems)sim_options={'name':'cosine','user_based':False,'min_support':10}algo=KNNBasic(sim_options=sim_options)

26

MAINFEATURES

Customalgorithmsareeasytoimplement

27

CUSTOMPREDICTIONALGORITHMclassMyRatingPredictor(AlgoBase):

defestimate(self,u,i):return3

algo=MyRatingPredictor()algo.train(trainset)pred=algo.predict('Alice','Titanic')#willcallestimate

28

CUSTOMPREDICTIONALGORITHMclassMyRatingPredictor(AlgoBase):

deftrain(self,trainset):'''Fityourdatahere.'''

self.mean=np.mean([r_uifor(u,i,r_ui)intrainset.all_ratings()])

defestimate(self,u,i):returnself.mean

29

CUSTOMPREDICTIONALGORITHMclassMyRatingPredictor(AlgoBase):

deftrain(self,trainset):'''Fityourdatahere.'''

self.mean=np.mean([r_uifor(u,i,r_ui)intrainset.all_ratings()])

defestimate(self,u,i):returnself.mean

trainsetobject:Iteratorsovertheratinghistory(user-wise,item-wise,orunordered)IteratorsoverusersanditemsidsOtherusefulmethods/attributes

29

classMyKNN(AlgoBase):

deftrain(self,trainset):self.trainset=trainsetself.sim=...#Computesimilaritymatrix

defestimate(self,u,i):

neighbors=[(self.sim[u,v],r_vi)for(v,r_vj)inself.trainset.ur[u]]neighbors=sorted(neighbors,key=lambdatple:tple[0],reverse=True)

sum_sim=sum_ratings=0for(sim_uv,r_vi)inneighbors[:self.k]:sum_sim+=simsum_ratings+=sim*r

returnsum_ratings/sum_sim

30

MAINFEATURES

Cross-validation,grid-search

31

TYPICALEVALUATIONPROTOCOL

Hidesomeofthe (theybecome )

Usetheremaining topredictthe

32

TYPICALEVALUATIONPROTOCOL

Hidesomeofthe (theybecome )

Usetheremaining topredictthe

RMSE=

Theaverageerror,wherebigerrorsareheavilypenalized(squared)

32

TYPICALEVALUATIONPROTOCOL

Hidesomeofthe (theybecome )

Usetheremaining topredictthe

RMSE=

Theaverageerror,wherebigerrorsareheavilypenalized(squared)Dothatmanytimeswithdifferentsubsets:thisiscross-validation

32

CROSSVALIDATIONdata=Dataset.load_builtin('ml-100k')#downloaddatasetdata.split(n_folds=3)#3-foldscrossvalidation

algo=SVD()

fortrainset,testsetindata.folds():algo.train(trainset)#fitdatapredictions=algo.test(testset)#predictratingsaccuracy.rmse(predictions,verbose=True)#printRMSE

33

GRID-SEARCHparam_grid={'n_epochs':[5,10],'lr_all':[0.002,0.005],'reg_all':[0.4,0.6]}grid_search=GridSearch(SVD,param_grid,measures=['RMSE','FCP'])

data=Dataset.load_builtin('ml-100k')data.split(n_folds=3)

grid_search.evaluate(data)#willevaluateeachcombination

print(grid_search.best_score['RMSE'])print(grid_search.best_params['RMSE'])algo=grid_search.best_estimator['RMSE'])

34

MAINFEATURES

Built-inevaluationmeasures(RMSE,MAE...)

35

EVALUATIONTOOLS

CanalsocomputeRecall,Precisionandtop-Nrelatedmeasureseasily

fortrainset,testsetindata.folds():algo.train(trainset)#fitdatapredictions=algo.test(testset)#predictratings

accuracy.rmse(predictions)accuracy.mae(predictions)accuracy.fcp(predictions)

36

MAINFEATURES

Modelpersistence

37

MODELPERSISTENCE

Canalsodumpthepredictionstoanalyzeandcarefullycomparealgorithmswithpandas

#...gridsearchalgo=grid_search.best_estimator['RMSE'])

#dumpalgorithmfile_name=os.path.expanduser('~/dump_file')dump.dump(file_name,algo=algo)

#...#ClosePython,gotobed#...

#Wakeup,reloadalgorithm,profit_,algo=dump.load(file_name)

38

OUTLINE1. THENEIGHBORHOODMETHOD(K-NN)2. OVERVIEWOF

( )3. MATRIXFACTORIZATION

SURPRISEhttp://surpriselib.com

39

3. MATRIXFACTORIZATION

39

MATRIXFACTORIZATION

40

MATRIXFACTORIZATIONAlotofhypeduringtheNetflixPrize(2006-2009:improveoursystem,getrich)ModeltheratingsinaninsightfulwayTakesitsrootindimensionalreductionandSVD

41

BEFORESVD:PCAHereare400greyscaleimages(64x64):

Putthemina400x4096matrix :

42

PCAalsogivesyouthe .

Inadvance:youdon'tneedallthe400guys

BEFORESVD:PCAPCAwillreveal400ofthosecreepytypicalguys:

Theseguyscanbuildbackalloftheoriginalfaces

43

PCAONARATINGMATRIX?SURE!Assumeallratingsareknown

Exactsamething!Wejusthaveratingsinsteadofpixels.

PCAwillrevealtypicalusers.

44

PCAONARATINGMATRIX?SURE!Assumeallratingsareknown

Exactsamething!Wejusthaveratingsinsteadofpixels.

PCAwillrevealtypicalusers.

44

PCAONARATINGMATRIX?SURE!Assumeallratingsareknown.Transposethematrix

Exactsamething!PCAwillrevealtypicalmovies.

45

PCAONARATINGMATRIX?SURE!Assumeallratingsareknown.Transposethematrix

Exactsamething!PCAwillrevealtypicalmovies.

Note:inpractice,thefactorssemanticisnotclearlydefined.

45

SVDISPCA2

PCAon givesyouthetypicalusersPCAon givesyouthetypicalmoviesSVDgivesyoubothinoneshot!

46

SVDISPCA2

PCAon givesyouthetypicalusersPCAon givesyouthetypicalmoviesSVDgivesyoubothinoneshot!

isdiagonal,it'sjustascaler.

Thisisourmatrixfactorization!

46

THEMODELOFSVD

47

THEMODELOFSVD

47

THEMODELOFSVD

47

THEMODELOFSVD

Rating(Alice,Titanic)willbehighRating(Bob,Titanic)willbelow

47

SOHOWTOCOMPUTE AND ?

Columnsof aretheeigenvectorsofColumnsof aretheeigenvectorsofAssociatedeigenvaluesmakeupthediagonalofThereareveryefficientalgorithmsinthewild

48

SOHOWTOCOMPUTE AND ?

Columnsof aretheeigenvectorsofColumnsof aretheeigenvectorsofAssociatedeigenvaluesmakeupthediagonalofThereareveryefficientalgorithmsinthewild

Alternateoption:findthe sandthe sthatminimizethereconstructionerror

(Withsomeorthogonalityconstraints).Therearealsoveryefficientalgorithmsinthewild

48

SOHOWTOCOMPUTE AND ?

Columnsof aretheeigenvectorsofColumnsof aretheeigenvectorsofAssociatedeigenvaluesmakeupthediagonalofThereareveryefficientalgorithmsinthewild

Alternateoption:findthe sandthe sthatminimizethereconstructionerror

(Withsomeorthogonalityconstraints).Therearealsoveryefficientalgorithmsinthewild

So,pieceofcakeright?

48

OOPS!

49

WEASSUMEDTHAT WASDENSEButit'snot!99%aremissing

and arenotevendefinedSo and arenotdefinedeitherThereisnoSVD

50

WEASSUMEDTHAT WASDENSEButit'snot!99%aremissing

and arenotevendefinedSo and arenotdefinedeitherThereisnoSVD

Twooptions:Fillthemissingentrieswithasimpleheuristic"Let'sjustnotgiveadamn"--SimonFunk(ended-uptop-3oftheNetflixPrizeforsometime)

50

Densecase:findthe sandthe sthatminimizethetotalreconstructionerror

(Withorthogonalityconstraints)

Sparsecase:findthe sandthe sthatminimizethepartialreconstructionerror

(Forgetaboutorthogonality)

APPROXIMATION

51

Densecase:findthe sandthe sthatminimizethetotalreconstructionerror

(Withorthogonalityconstraints)

Sparsecase:findthe sandthe sthatminimizethepartialreconstructionerror

(Forgetaboutorthogonality)

APPROXIMATION

Basicallythesame,exceptwedon'thavealltheratings(andwedon'tcare)

51

Wewantthevalue suchthat

isminimal:

ComputeRandomlyinitialize

(dothisuntilyou'refedup)

MINIMIZATIONBYGRADIENTDESCENT

52

MINIMIZATIONBY(STOCHASTIC)GRADIENTDESCENT

Ourparameter is forall and .Thefunction tobeminimizedis:

defcompute_SVD():'''FitpuandqitoknownratingsbySGD'''p=np.random.normal(size=F)q=np.random.normal(size=F)foriterinrange(n_max_iter):foru,i,r_uiinrating_hist:err=r_ui-np.dot(p[u],q[i])p[u]=p[u]+learning_rate*err*q[i]q[i]=q[i]+learning_rate*err*p[u]

defestimate_rating(u,i):returnnp.dot(p[u],q[i])

53

SOMELASTDETAILSUnbiastheratings,addregularization:youget"SVD":

54

SOMELASTDETAILSUnbiastheratings,addregularization:youget"SVD":

Goodol'fashionedMLparadigm:Assumeamodelforyourdata( )FityourmodelparameterstotheobserveddataPraisetheLordandhopeforthebest

54

AFEWREMARKSismostlyatoolforalgorithmsthatpredict

ratings.RSdomuchmorethanthat.Focusonexplicitratings(implicitratingsalgorithmswillcome,hopefully)NotaimedtobeacompletetoolforbuildingRSfromAtoZ(checkout )Notaimedtobethemostefficientimplementation(checkout )

Surprise

Mangaki

LightFM

55

SOMEREFERENCESCan'trecommendenough(punintended)Aggarwal'sRecommenderSystems-TheTextbookJeremyKun's (greatinsightson and )blog PCA SVD

56

THANKS!

57