cs 5614: (big) data management...
TRANSCRIPT
CS5614:(Big)DataManagementSystems
B.AdityaPrakashLecture#16:Recommenda2on
Systems
Example:RecommenderSystems
§ CustomerX– BuysMetallicaCD– BuysMegadethCD
§ CustomerY– DoessearchonMetallica– RecommendersystemsuggestsMegadethfromdatacollectedaboutcustomerXVTCS5614 2Prakash2017
RecommendaCons
Items
Search Recommendations
Products, web sites, blogs, news items, …
3VTCS5614
Examples:
Prakash2017
FromScarcitytoAbundance§ ShelfspaceisascarcecommodityfortradiConalretailers– Also:TVnetworks,movietheaters,…
§ Webenablesnear-zero-costdisseminaConofinformaConaboutproducts– Fromscarcitytoabundance
§ MorechoicenecessitatesbeLerfilters– Recommenda2onengines– HowIntoThinAirmadeTouchingtheVoidabestseller:hQp://www.wired.com/wired/archive/12.10/tail.html
VTCS5614 4Prakash2017
Sidenote:TheLongTail
Source: Chris Anderson (2004) 5VTCS5614Prakash2017
Physicalvs.Online
Prakash2017 VTCS5614 6Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!
TypesofRecommendaCons
§ Editorialandhandcurated– Listoffavorites– Listsof“essen2al”items
§ Simpleaggregates– Top10,MostPopular,RecentUploads
§ Tailoredtoindividualusers– Amazon,NeZlix,…
7VTCS5614Prakash2017
FormalModel
§ X=setofCustomers§ S=setofItems
§ UClityfuncConu:X× SàR– R=setofra2ngs– Risatotallyorderedset– e.g.,0-5stars,realnumberin[0,1]
8VTCS5614Prakash2017
UClityMatrix
0.410.2
0.30.50.21
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
9VTCS5614Prakash2017
KeyProblems§ (1)Gathering“known”raCngsformatrix
– Howtocollectthedataintheu2litymatrix
§ (2)ExtrapolateunknownraCngsfromtheknownones– Mainlyinterestedinhighunknownra2ngs
• Wearenotinterestedinknowingwhatyoudon’tlikebutwhatyoulike
§ (3)EvaluaCngextrapolaConmethods– Howtomeasuresuccess/performanceofrecommenda2onmethods
10VTCS5614Prakash2017
(1)GatheringRaCngs
§ Explicit– Askpeopletorateitems– Doesn’tworkwellinprac2ce–peoplecan’tbebothered
§ Implicit– Learnra2ngsfromuserac2ons
• E.g.,purchaseimplieshighra2ng
– Whataboutlowra2ngs?
11VTCS5614Prakash2017
(2)ExtrapolaCngUCliCes
§ Keyproblem:U2litymatrixUissparse– Mostpeoplehavenotratedmostitems– Coldstart:
• Newitemshavenora2ngs• Newusershavenohistory
§ Threeapproachestorecommendersystems:– 1)Content-based– 2)Collabora2ve– 3)Latentfactorbased
12VTCS5614
Only briefly
Prakash2017
CONTENT-BASEDRECOMMENDERSYSTEMS
Prakash2017 VTCS5614 13
Content-basedRecommendaCons
§ Mainidea:Recommenditemstocustomerxsimilartopreviousitemsratedhighlybyx
Example:§ MovierecommendaCons
– Recommendmovieswithsameactor(s),director,genre,…
§ Websites,blogs,news– Recommendothersiteswith“similar”content
VTCS5614 14Prakash2017
PlanofAcCon
likes
Item profiles
Red Circles
Triangles
User profile
match
recommend build
15VTCS5614Prakash2017
ItemProfiles§ Foreachitem,createanitemprofile
§ Profileisaset(vector)offeatures– Movies:author,2tle,actor,director,…– Text:Setof“important”wordsindocument
§ Howtopickimportantfeatures?– Usualheuris2cfromtextminingisTF-IDF(Termfrequency*InverseDocFrequency)
• Term…Feature• Document…Item
16VTCS5614Prakash2017
Sidenote:TF-IDFfij=frequencyofterm(feature)iindoc(item)j
ni=numberofdocsthatmen2ontermiN=totalnumberofdocsTF-IDFscore:wij=TFij×IDFi
Docprofile=setofwordswithhighestTF-IDFscores,togetherwiththeirscores
17VTCS5614
Note: we normalize TF to discount for “longer” documents
Prakash2017
UserProfilesandPredicCon§ UserprofilepossibiliCes:
– Weightedaverageofrateditemprofiles– VariaCon:weightbydifferencefromaveragera2ngforitem
– …§ PredicConheurisCc:
– Givenuserprofilexanditemprofilei,es2mate𝑢(𝒙,𝒊) = cos(𝒙,𝒊) = 𝒙·𝒊/||𝒙||⋅||𝒊||
VTCS5614 18Prakash2017
Pros:Content-basedApproach§ +:Noneedfordataonotherusers
– Nocold-startorsparsityproblems
§ +:Abletorecommendtouserswithuniquetastes
§ +:Abletorecommendnew&unpopularitems– Nofirst-raterproblem
§ +:AbletoprovideexplanaCons– Canprovideexplana2onsofrecommendeditemsbylis2ngcontent-featuresthatcausedanitemtoberecommended
VTCS5614 19Prakash2017
Cons:Content-basedApproach§ –:Findingtheappropriatefeaturesishard
– E.g.,images,movies,music
§ –:RecommendaConsfornewusers– Howtobuildauserprofile?
§ –:OverspecializaCon– Neverrecommendsitemsoutsideuser’scontentprofile
– Peoplemighthavemul2pleinterests– Unabletoexploitqualityjudgmentsofotherusers
VTCS5614 20Prakash2017
COLLABORATIVEFILTERING
Harnessingqualityjudgmentsofotherusers
Prakash2017 VTCS5614 21
CollaboraCveFiltering§ Consideruserx
§ FindsetNofotheruserswhosera2ngsare“similar”tox’sra2ngs
§ Es2matex’sra2ngsbasedonra2ngsofusersinN
22VTCS5614
x
N
Prakash2017
Finding“Similar”Users§ Letrxbethevectorofuserx’sra2ngs§ Jaccardsimilaritymeasure
– Problem:Ignoresthevalueofthera2ng§ Cosinesimilaritymeasure
– sim(x,y)=cos(rx,ry)– Problem:Treatsmissingra2ngsas“nega2ve”
§ PearsoncorrelaConcoefficient– Sxy=itemsratedbybothusersxandy
VTCS5614 23
rx = [*, _, _, *, ***] ry = [*, _, **, **, _]
rx, ry as sets: rx = {1, 4, 5} ry = {1, 3, 4}
rx, ry as points: rx = {1, 0, 0, 1, 3} ry = {1, 0, 2, 2, 0}
rx, ry … avg. rating of x, y
Prakash2017
sim(x, y) =(rxs − rx
−
)(s∈Sxy
∑ rys − ry−
)
(rxs − rx−
)2s∈Sxy
∑ (rys − ry−
)2s∈Sxy
∑
SimilarityMetric
§ IntuiCvelywewant:sim(A,B)>sim(A,C)§ Jaccardsimilarity:1/5<2/4§ Cosinesimilarity:0.386>0.322
– Considersmissingra2ngsas“nega2ve”– SoluCon:subtractthe(row)mean
VTCS5614 24
sim A,B vs. A,C: 0.092 > -0.559 Notice cosine sim. is correlation when data is centered at 0
Prakash2017
RaCngPredicCons
Prakash2017 VTCS5614 25
rxi
= 1/kP
y2N
ryi
rxi
=P
y2N
s
xy
r
yiPy2N
s
xy
Item-ItemCollaboraCveFiltering§ Sofar:User-usercollaboraCvefiltering§ Anotherview:Item-item
– Foritemi,findothersimilaritems– Es2matera2ngforitemibasedonra2ngsforsimilaritems
– Canusesamesimilaritymetricsandpredic2onfunc2onsasinuser-usermodel
VTCS5614 26
∑∑
∈
∈⋅
=);(
);(
xiNj ij
xiNj xjijxi s
rsr
sij… similarity of items i and j rxj…rating of user u on item j N(i;x)… set items rated by x similar to i
Prakash2017
Item-ItemCF(|N|=2)
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
users
movies
- unknown rating - rating between 1 to 5 27VTCS5614Prakash2017
Item-ItemCF(|N|=2)
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 ? 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
users
- estimate rating of movie 1 by user 5 28VTCS5614
movies
Prakash2017
Item-ItemCF(|N|=2)
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 ? 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
users
Neighbor selection: Identify movies similar to movie 1, rated by user 5 29VTCS5614
movies
1.00
-0.18
0.41
-0.10
-0.31
0.59
sim(1,m)
Here we use Pearson correlation as similarity: 1) Subtract mean rating mi from each movie i m1 = (1+3+5+5+4)/5 = 3.6 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 2) Compute cosine similarities between rows Prakash2017
Item-ItemCF(|N|=2)
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 ? 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
users
Compute similarity weights: s1,3=0.41, s1,6=0.59
30VTCS5614
movies
1.00
-0.18
0.41
-0.10
-0.31
0.59
sim(1,m)
Prakash2017
Item-ItemCF(|N|=2)
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 2.6 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
users
Predict by taking weighted average:
r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6 31VTCS5614
movies
Prakash2017∑∑
∈
∈⋅
=);(
);(
xiNj ij
xiNj xjijxi s
rsr
CF:CommonPracCce§ Definesimilaritysijofitemsiandj§ SelectknearestneighborsN(i;x)
– Itemsmostsimilartoi,thatwereratedbyx
§ Es2matera2ngrxiastheweightedaverage:
VTCS5614 32
baselineesCmateforrxi μ=overallmeanmoviera2ngbx=ra2ngdevia2onofuserx=(avg.ra1ngofuserx)–μbi=ra2ngdevia2onofmoviei
∑∑
∈
∈=);(
);(
xiNj ij
xiNj xjijxi s
rsr
Before:
∑∑
∈
∈−⋅
+=);(
);()(
xiNj ij
xiNj xjxjijxixi s
brsbr
Prakash2017
bxi
= µ+ bx
+ bi
Item-Itemvs.User-User
0.418.010.90.30.5
0.81Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
33VTCS5614
¡ InpracCce,ithasbeenobservedthatitem-itemoqenworksbeLerthanuser-user
¡ Why?Itemsaresimpler,usershavemul2pletastesPrakash2017
Pros/ConsofCollaboraCveFiltering§ +Worksforanykindofitem
– Nofeatureselec2onneeded§ -ColdStart:
– Needenoughusersinthesystemtofindamatch§ -Sparsity:
– Theuser/ra2ngsmatrixissparse– Hardtofindusersthathaveratedthesameitems
§ -Firstrater:– Cannotrecommendanitemthathasnotbeen
previouslyrated– Newitems,Esotericitems
§ -Popularitybias:– Cannotrecommenditemstosomeonewith
uniquetaste– Tendstorecommendpopularitems
VTCS5614 34Prakash2017
HybridMethods
§ ImplementtwoormoredifferentrecommendersandcombinepredicCons– Perhapsusingalinearmodel
§ Addcontent-basedmethodstocollaboraCvefiltering– Itemprofilesfornewitemproblem– Demographicstodealwithnewuserproblem
35VTCS5614Prakash2017
LATENTFACTORMODELS
Prakash2017 VTCS5614 36
Gearedtowardsfemales
Gearedtowardsmales
Serious
Funny
LatentFactorModels(e.g.,SVD)
37VTCS5614
The Princess Diaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
Amadeus The Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Prakash2017
LatentFactorModels§ “SVD”onNeZlixdata:R≈Q·PT
§ Fornowlet’sassumewecanapproximatethera2ngmatrixRasaproductof“thin”Q·PT– Rhasmissingentriesbutlet’signorethatfornow!
• Basically,wewillwantthereconstruc2onerrortobesmallonknownra2ngsandwedon’tcareaboutthevaluesonthemissingones
VTCS5614 38
45531
312445
53432142
24542
522434
42331
.2 -.4 .1
.5 .6 -.5
.5 .3 -.2
.3 2.1 1.1
-2 2.1 -.7
.3 .7 -1
-.9 2.4 1.4 .3 -.4 .8 -.5 -2 .5 .3 -.2 1.1
1.3 -.1 1.2 -.7 2.9 1.4 -1 .3 1.4 .5 .7 -.8
.1 -.6 .7 .8 .4 -.3 .9 2.4 1.7 .6 -.4 2.1 ≈
users
item
s
PT
Q
item
s
users
R
SVD: A = U Σ VT
factors
factors
Prakash2017
RaCngsasProductsofFactors
§ HowtoesCmatethemissingraCngofuserxforitemi?
VTCS5614 39
45531
312445
53432142
24542
522434
42331
item
s
.2 -.4 .1
.5 .6 -.5
.5 .3 -.2
.3 2.1 1.1
-2 2.1 -.7
.3 .7 -1
-.9 2.4 1.4 .3 -.4 .8 -.5 -2 .5 .3 -.2 1.1
1.3 -.1 1.2 -.7 2.9 1.4 -1 .3 1.4 .5 .7 -.8
.1 -.6 .7 .8 .4 -.3 .9 2.4 1.7 .6 -.4 2.1
≈
item
s
users
users
?
PT
qi = row i of Q px = column x of PT
fact
ors
Q factors Prakash2017
RaCngsasProductsofFactors§ HowtoesCmatethemissingraCngof
userxforitemi?
VTCS5614 40
45531
312445
53432142
24542
522434
42331
item
s
.2 -.4 .1
.5 .6 -.5
.5 .3 -.2
.3 2.1 1.1
-2 2.1 -.7
.3 .7 -1
-.9 2.4 1.4 .3 -.4 .8 -.5 -2 .5 .3 -.2 1.1
1.3 -.1 1.2 -.7 2.9 1.4 -1 .3 1.4 .5 .7 -.8
.1 -.6 .7 .8 .4 -.3 .9 2.4 1.7 .6 -.4 2.1
≈
item
s
users
users
?
PT
fact
ors
Q factors
qi = row i of Q px = column x of PT
Prakash2017
RaCngsasProductsofFactors§ HowtoesCmatethemissingraCngof
userxforitemi?
VTCS5614 41
45531
312445
53432142
24542
522434
42331
item
s
.2 -.4 .1
.5 .6 -.5
.5 .3 -.2
.3 2.1 1.1
-2 2.1 -.7
.3 .7 -1
-.9 2.4 1.4 .3 -.4 .8 -.5 -2 .5 .3 -.2 1.1
1.3 -.1 1.2 -.7 2.9 1.4 -1 .3 1.4 .5 .7 -.8
.1 -.6 .7 .8 .4 -.3 .9 2.4 1.7 .6 -.4 2.1
≈
item
s
users
users
?
Q PT
2.4
f fac
tors
f factors
qi = row i of Q px = column x of PT
Prakash2017
Gearedtowardsfemales
Gearedtowardsmales
Serious
Funny
LatentFactorModels
42VTCS5614
The Princess Diaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
Amadeus The Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Factor 1
Fact
or 2
Prakash2017
Gearedtowardsfemales
Gearedtowardsmales
Serious
Funny
LatentFactorModels
43VTCS5614
The Princess Diaries
The Lion King
Braveheart
Lethal Weapon
Independence Day
Amadeus The Color Purple
Dumb and Dumber
Ocean’s 11
Sense and Sensibility
Factor 1
Fact
or 2
Prakash2017
Formoredetails
§ Readthetextbook!
Prakash2017 VTCS5614 44
REMARKS&PRACTICALTIPS
-EvaluaCon-Errormetrics-Complexity/Speed
Prakash2017 VTCS5614 45
EvaluaCon
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
movies
users
VTCS5614 46Prakash2017
EvaluaCon
1 3 4
3 5 5
4 5 5
3
3
2 ? ?
?
2 1 ?
3 ?
1
Test Data Set
users
movies
VTCS5614 47Prakash2017
EvaluaCngPredicCons§ ComparepredicConswithknownraCngs
– Root-mean-squareerror(RMSE)• whereispredicted,isthetruera2ngofxoni
– Precisionattop10:• %ofthoseintop10
– RankCorrelaCon:• Spearman’scorrela1onbetweensystem’sanduser’scompleterankings
§ Anotherapproach:0/1model– Coverage:
• Numberofitems/usersforwhichsystemcanmakepredic2ons– Precision:
• Accuracyofpredic2ons– ReceiveroperaCngcharacterisCc(ROC)
• Tradeoffcurvebetweenfalseposi2vesandfalsenega2ves
VTCS5614 48Prakash2017
qPxi(r
xi � r⇤xi)2 r=predicted;r*=true
ra2ngofxoni
ProblemswithErrorMeasures
§ NarrowfocusonaccuracysomeCmesmissesthepoint– Predic2onDiversity– Predic2onContext– Orderofpredic2ons
§ InpracCce,wecareonlytopredicthighraCngs:– RMSEmightpenalizeamethodthatdoeswellforhighra2ngsandbadlyforothers
49VTCS5614Prakash2017
CollaboraCveFiltering:Complexity
§ Expensivestepisfindingkmostsimilarcustomers:O(|X|)
§ TooexpensivetodoatrunCme– Couldpre-compute
§ Naïvepre-computa2ontakes2meO(k·|X|)– X…setofcustomers
§ Wealreadyknowhowtodothis!– Near-neighborsearchinhighdimensions(LSH)– Clustering– Dimensionalityreduc2on
50VTCS5614Prakash2017
Tip:AddData
§ Leverageallthedata– Don’ttrytoreducedatasizeinanefforttomakefancyalgorithmswork
– Simplemethodsonlargedatadobest
§ Addmoredata– e.g.,addIMDBdataongenres
§ MoredatabeatsbeLeralgorithmshttp://anand.typepad.com/datawocky/2008/03/more-data-usual.html
VTCS5614 51Prakash2017