cs 5614: (big) data management...

51
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16: Recommenda2on Systems

Upload: others

Post on 29-Mar-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

CS5614:(Big)DataManagementSystems

B.AdityaPrakashLecture#16:Recommenda2on

Systems

Page 2: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Example:RecommenderSystems

§  CustomerX–  BuysMetallicaCD–  BuysMegadethCD

§  CustomerY–  DoessearchonMetallica–  RecommendersystemsuggestsMegadethfromdatacollectedaboutcustomerXVTCS5614 2Prakash2017

Page 3: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

RecommendaCons

Items

Search Recommendations

Products, web sites, blogs, news items, …

3VTCS5614

Examples:

Prakash2017

Page 4: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

FromScarcitytoAbundance§  ShelfspaceisascarcecommodityfortradiConalretailers– Also:TVnetworks,movietheaters,…

§ Webenablesnear-zero-costdisseminaConofinformaConaboutproducts– Fromscarcitytoabundance

§ MorechoicenecessitatesbeLerfilters– Recommenda2onengines– HowIntoThinAirmadeTouchingtheVoidabestseller:hQp://www.wired.com/wired/archive/12.10/tail.html

VTCS5614 4Prakash2017

Page 5: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Sidenote:TheLongTail

Source: Chris Anderson (2004) 5VTCS5614Prakash2017

Page 6: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Physicalvs.Online

Prakash2017 VTCS5614 6Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!

Page 7: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

TypesofRecommendaCons

§  Editorialandhandcurated– Listoffavorites– Listsof“essen2al”items

§  Simpleaggregates– Top10,MostPopular,RecentUploads

§  Tailoredtoindividualusers– Amazon,NeZlix,…

7VTCS5614Prakash2017

Page 8: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

FormalModel

§ X=setofCustomers§  S=setofItems

§ UClityfuncConu:X× SàR– R=setofra2ngs– Risatotallyorderedset– e.g.,0-5stars,realnumberin[0,1]

8VTCS5614Prakash2017

Page 9: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

UClityMatrix

0.410.2

0.30.50.21

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

9VTCS5614Prakash2017

Page 10: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

KeyProblems§  (1)Gathering“known”raCngsformatrix

– Howtocollectthedataintheu2litymatrix

§  (2)ExtrapolateunknownraCngsfromtheknownones– Mainlyinterestedinhighunknownra2ngs

• Wearenotinterestedinknowingwhatyoudon’tlikebutwhatyoulike

§  (3)EvaluaCngextrapolaConmethods– Howtomeasuresuccess/performanceofrecommenda2onmethods

10VTCS5614Prakash2017

Page 11: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

(1)GatheringRaCngs

§  Explicit– Askpeopletorateitems– Doesn’tworkwellinprac2ce–peoplecan’tbebothered

§  Implicit– Learnra2ngsfromuserac2ons

•  E.g.,purchaseimplieshighra2ng

– Whataboutlowra2ngs?

11VTCS5614Prakash2017

Page 12: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

(2)ExtrapolaCngUCliCes

§  Keyproblem:U2litymatrixUissparse– Mostpeoplehavenotratedmostitems–  Coldstart:

•  Newitemshavenora2ngs•  Newusershavenohistory

§  Threeapproachestorecommendersystems:–  1)Content-based–  2)Collabora2ve–  3)Latentfactorbased

12VTCS5614

Only briefly

Prakash2017

Page 13: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

CONTENT-BASEDRECOMMENDERSYSTEMS

Prakash2017 VTCS5614 13

Page 14: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Content-basedRecommendaCons

§ Mainidea:Recommenditemstocustomerxsimilartopreviousitemsratedhighlybyx

Example:§ MovierecommendaCons

– Recommendmovieswithsameactor(s),director,genre,…

§ Websites,blogs,news– Recommendothersiteswith“similar”content

VTCS5614 14Prakash2017

Page 15: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

PlanofAcCon

likes

Item profiles

Red Circles

Triangles

User profile

match

recommend build

15VTCS5614Prakash2017

Page 16: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

ItemProfiles§  Foreachitem,createanitemprofile

§  Profileisaset(vector)offeatures– Movies:author,2tle,actor,director,…–  Text:Setof“important”wordsindocument

§  Howtopickimportantfeatures?– Usualheuris2cfromtextminingisTF-IDF(Termfrequency*InverseDocFrequency)

•  Term…Feature•  Document…Item

16VTCS5614Prakash2017

Page 17: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Sidenote:TF-IDFfij=frequencyofterm(feature)iindoc(item)j

ni=numberofdocsthatmen2ontermiN=totalnumberofdocsTF-IDFscore:wij=TFij×IDFi

Docprofile=setofwordswithhighestTF-IDFscores,togetherwiththeirscores

17VTCS5614

Note: we normalize TF to discount for “longer” documents

Prakash2017

Page 18: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

UserProfilesandPredicCon§  UserprofilepossibiliCes:

– Weightedaverageofrateditemprofiles– VariaCon:weightbydifferencefromaveragera2ngforitem

– …§  PredicConheurisCc:

– Givenuserprofilexanditemprofilei,es2mate𝑢(𝒙,𝒊) = cos(𝒙,𝒊) = 𝒙·𝒊/||𝒙||⋅||𝒊|| 

VTCS5614 18Prakash2017

Page 19: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Pros:Content-basedApproach§  +:Noneedfordataonotherusers

– Nocold-startorsparsityproblems

§  +:Abletorecommendtouserswithuniquetastes

§  +:Abletorecommendnew&unpopularitems– Nofirst-raterproblem

§  +:AbletoprovideexplanaCons– Canprovideexplana2onsofrecommendeditemsbylis2ngcontent-featuresthatcausedanitemtoberecommended

VTCS5614 19Prakash2017

Page 20: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Cons:Content-basedApproach§  –:Findingtheappropriatefeaturesishard

– E.g.,images,movies,music

§  –:RecommendaConsfornewusers– Howtobuildauserprofile?

§  –:OverspecializaCon– Neverrecommendsitemsoutsideuser’scontentprofile

– Peoplemighthavemul2pleinterests– Unabletoexploitqualityjudgmentsofotherusers

VTCS5614 20Prakash2017

Page 21: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

COLLABORATIVEFILTERING

Harnessingqualityjudgmentsofotherusers

Prakash2017 VTCS5614 21

Page 22: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

CollaboraCveFiltering§  Consideruserx

§  FindsetNofotheruserswhosera2ngsare“similar”tox’sra2ngs

§  Es2matex’sra2ngsbasedonra2ngsofusersinN

22VTCS5614

x

N

Prakash2017

Page 23: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Finding“Similar”Users§  Letrxbethevectorofuserx’sra2ngs§  Jaccardsimilaritymeasure

–  Problem:Ignoresthevalueofthera2ng§  Cosinesimilaritymeasure

–  sim(x,y)=cos(rx,ry)–  Problem:Treatsmissingra2ngsas“nega2ve”

§  PearsoncorrelaConcoefficient–  Sxy=itemsratedbybothusersxandy

VTCS5614 23

rx = [*, _, _, *, ***] ry = [*, _, **, **, _]

rx, ry as sets: rx = {1, 4, 5} ry = {1, 3, 4}

rx, ry as points: rx = {1, 0, 0, 1, 3} ry = {1, 0, 2, 2, 0}

rx, ry … avg. rating of x, y

Prakash2017

sim(x, y) =(rxs − rx

)(s∈Sxy

∑ rys − ry−

)

(rxs − rx−

)2s∈Sxy

∑ (rys − ry−

)2s∈Sxy

Page 24: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

SimilarityMetric

§  IntuiCvelywewant:sim(A,B)>sim(A,C)§  Jaccardsimilarity:1/5<2/4§  Cosinesimilarity:0.386>0.322

–  Considersmissingra2ngsas“nega2ve”–  SoluCon:subtractthe(row)mean

VTCS5614 24

sim A,B vs. A,C: 0.092 > -0.559 Notice cosine sim. is correlation when data is centered at 0

Prakash2017

Page 25: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

RaCngPredicCons

Prakash2017 VTCS5614 25

rxi

= 1/kP

y2N

ryi

rxi

=P

y2N

s

xy

r

yiPy2N

s

xy

Page 26: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Item-ItemCollaboraCveFiltering§  Sofar:User-usercollaboraCvefiltering§  Anotherview:Item-item

–  Foritemi,findothersimilaritems–  Es2matera2ngforitemibasedonra2ngsforsimilaritems

–  Canusesamesimilaritymetricsandpredic2onfunc2onsasinuser-usermodel

VTCS5614 26

∑∑

∈⋅

=);(

);(

xiNj ij

xiNj xjijxi s

rsr

sij… similarity of items i and j rxj…rating of user u on item j N(i;x)… set items rated by x similar to i

Prakash2017

Page 27: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Item-ItemCF(|N|=2)

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users

movies

- unknown rating - rating between 1 to 5 27VTCS5614Prakash2017

Page 28: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Item-ItemCF(|N|=2)

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users

- estimate rating of movie 1 by user 5 28VTCS5614

movies

Prakash2017

Page 29: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Item-ItemCF(|N|=2)

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users

Neighbor selection: Identify movies similar to movie 1, rated by user 5 29VTCS5614

movies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Here we use Pearson correlation as similarity: 1) Subtract mean rating mi from each movie i m1 = (1+3+5+5+4)/5 = 3.6 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 2) Compute cosine similarities between rows Prakash2017

Page 30: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Item-ItemCF(|N|=2)

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users

Compute similarity weights: s1,3=0.41, s1,6=0.59

30VTCS5614

movies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Prakash2017

Page 31: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Item-ItemCF(|N|=2)

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 2.6 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users

Predict by taking weighted average:

r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6 31VTCS5614

movies

Prakash2017∑∑

∈⋅

=);(

);(

xiNj ij

xiNj xjijxi s

rsr

Page 32: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

CF:CommonPracCce§  Definesimilaritysijofitemsiandj§  SelectknearestneighborsN(i;x)

–  Itemsmostsimilartoi,thatwereratedbyx

§  Es2matera2ngrxiastheweightedaverage:

VTCS5614 32

baselineesCmateforrxi μ=overallmeanmoviera2ngbx=ra2ngdevia2onofuserx=(avg.ra1ngofuserx)–μbi=ra2ngdevia2onofmoviei

∑∑

∈=);(

);(

xiNj ij

xiNj xjijxi s

rsr

Before:

∑∑

∈−⋅

+=);(

);()(

xiNj ij

xiNj xjxjijxixi s

brsbr

Prakash2017

bxi

= µ+ bx

+ bi

Page 33: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Item-Itemvs.User-User

0.418.010.90.30.5

0.81Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

33VTCS5614

¡  InpracCce,ithasbeenobservedthatitem-itemoqenworksbeLerthanuser-user

¡  Why?Itemsaresimpler,usershavemul2pletastesPrakash2017

Page 34: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Pros/ConsofCollaboraCveFiltering§  +Worksforanykindofitem

–  Nofeatureselec2onneeded§  -ColdStart:

–  Needenoughusersinthesystemtofindamatch§  -Sparsity:

–  Theuser/ra2ngsmatrixissparse–  Hardtofindusersthathaveratedthesameitems

§  -Firstrater:–  Cannotrecommendanitemthathasnotbeen

previouslyrated–  Newitems,Esotericitems

§  -Popularitybias:–  Cannotrecommenditemstosomeonewith

uniquetaste–  Tendstorecommendpopularitems

VTCS5614 34Prakash2017

Page 35: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

HybridMethods

§  ImplementtwoormoredifferentrecommendersandcombinepredicCons– Perhapsusingalinearmodel

§  Addcontent-basedmethodstocollaboraCvefiltering–  Itemprofilesfornewitemproblem– Demographicstodealwithnewuserproblem

35VTCS5614Prakash2017

Page 36: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

LATENTFACTORMODELS

Prakash2017 VTCS5614 36

Page 37: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Gearedtowardsfemales

Gearedtowardsmales

Serious

Funny

LatentFactorModels(e.g.,SVD)

37VTCS5614

The Princess Diaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

Amadeus The Color Purple

Dumb and Dumber

Ocean’s 11

Sense and Sensibility

Prakash2017

Page 38: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

LatentFactorModels§  “SVD”onNeZlixdata:R≈Q·PT

§  Fornowlet’sassumewecanapproximatethera2ngmatrixRasaproductof“thin”Q·PT–  Rhasmissingentriesbutlet’signorethatfornow!

•  Basically,wewillwantthereconstruc2onerrortobesmallonknownra2ngsandwedon’tcareaboutthevaluesonthemissingones

VTCS5614 38

45531

312445

53432142

24542

522434

42331

.2 -.4 .1

.5 .6 -.5

.5 .3 -.2

.3 2.1 1.1

-2 2.1 -.7

.3 .7 -1

-.9 2.4 1.4 .3 -.4 .8 -.5 -2 .5 .3 -.2 1.1

1.3 -.1 1.2 -.7 2.9 1.4 -1 .3 1.4 .5 .7 -.8

.1 -.6 .7 .8 .4 -.3 .9 2.4 1.7 .6 -.4 2.1 ≈

users

item

s

PT

Q

item

s

users

R

SVD: A = U Σ VT

factors

factors

Prakash2017

Page 39: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

RaCngsasProductsofFactors

§  HowtoesCmatethemissingraCngofuserxforitemi?

VTCS5614 39

45531

312445

53432142

24542

522434

42331

item

s

.2 -.4 .1

.5 .6 -.5

.5 .3 -.2

.3 2.1 1.1

-2 2.1 -.7

.3 .7 -1

-.9 2.4 1.4 .3 -.4 .8 -.5 -2 .5 .3 -.2 1.1

1.3 -.1 1.2 -.7 2.9 1.4 -1 .3 1.4 .5 .7 -.8

.1 -.6 .7 .8 .4 -.3 .9 2.4 1.7 .6 -.4 2.1

item

s

users

users

?

PT

qi = row i of Q px = column x of PT

fact

ors

Q factors Prakash2017

Page 40: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

RaCngsasProductsofFactors§  HowtoesCmatethemissingraCngof

userxforitemi?

VTCS5614 40

45531

312445

53432142

24542

522434

42331

item

s

.2 -.4 .1

.5 .6 -.5

.5 .3 -.2

.3 2.1 1.1

-2 2.1 -.7

.3 .7 -1

-.9 2.4 1.4 .3 -.4 .8 -.5 -2 .5 .3 -.2 1.1

1.3 -.1 1.2 -.7 2.9 1.4 -1 .3 1.4 .5 .7 -.8

.1 -.6 .7 .8 .4 -.3 .9 2.4 1.7 .6 -.4 2.1

item

s

users

users

?

PT

fact

ors

Q factors

qi = row i of Q px = column x of PT

Prakash2017

Page 41: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

RaCngsasProductsofFactors§  HowtoesCmatethemissingraCngof

userxforitemi?

VTCS5614 41

45531

312445

53432142

24542

522434

42331

item

s

.2 -.4 .1

.5 .6 -.5

.5 .3 -.2

.3 2.1 1.1

-2 2.1 -.7

.3 .7 -1

-.9 2.4 1.4 .3 -.4 .8 -.5 -2 .5 .3 -.2 1.1

1.3 -.1 1.2 -.7 2.9 1.4 -1 .3 1.4 .5 .7 -.8

.1 -.6 .7 .8 .4 -.3 .9 2.4 1.7 .6 -.4 2.1

item

s

users

users

?

Q PT

2.4

f fac

tors

f factors

qi = row i of Q px = column x of PT

Prakash2017

Page 42: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Gearedtowardsfemales

Gearedtowardsmales

Serious

Funny

LatentFactorModels

42VTCS5614

The Princess Diaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

Amadeus The Color Purple

Dumb and Dumber

Ocean’s 11

Sense and Sensibility

Factor 1

Fact

or 2

Prakash2017

Page 43: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Gearedtowardsfemales

Gearedtowardsmales

Serious

Funny

LatentFactorModels

43VTCS5614

The Princess Diaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

Amadeus The Color Purple

Dumb and Dumber

Ocean’s 11

Sense and Sensibility

Factor 1

Fact

or 2

Prakash2017

Page 44: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Formoredetails

§  Readthetextbook!

Prakash2017 VTCS5614 44

Page 45: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

REMARKS&PRACTICALTIPS

-EvaluaCon-Errormetrics-Complexity/Speed

Prakash2017 VTCS5614 45

Page 46: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

EvaluaCon

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

VTCS5614 46Prakash2017

Page 47: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

EvaluaCon

1 3 4

3 5 5

4 5 5

3

3

2 ? ?

?

2 1 ?

3 ?

1

Test Data Set

users

movies

VTCS5614 47Prakash2017

Page 48: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

EvaluaCngPredicCons§  ComparepredicConswithknownraCngs

–  Root-mean-squareerror(RMSE)•  whereispredicted,isthetruera2ngofxoni

–  Precisionattop10:•  %ofthoseintop10

–  RankCorrelaCon:•  Spearman’scorrela1onbetweensystem’sanduser’scompleterankings

§  Anotherapproach:0/1model–  Coverage:

•  Numberofitems/usersforwhichsystemcanmakepredic2ons–  Precision:

•  Accuracyofpredic2ons–  ReceiveroperaCngcharacterisCc(ROC)

•  Tradeoffcurvebetweenfalseposi2vesandfalsenega2ves

VTCS5614 48Prakash2017

qPxi(r

xi � r⇤xi)2 r=predicted;r*=true

ra2ngofxoni

Page 49: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

ProblemswithErrorMeasures

§  NarrowfocusonaccuracysomeCmesmissesthepoint– Predic2onDiversity– Predic2onContext– Orderofpredic2ons

§  InpracCce,wecareonlytopredicthighraCngs:– RMSEmightpenalizeamethodthatdoeswellforhighra2ngsandbadlyforothers

49VTCS5614Prakash2017

Page 50: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

CollaboraCveFiltering:Complexity

§  Expensivestepisfindingkmostsimilarcustomers:O(|X|)

§  TooexpensivetodoatrunCme–  Couldpre-compute

§  Naïvepre-computa2ontakes2meO(k·|X|)–  X…setofcustomers

§  Wealreadyknowhowtodothis!– Near-neighborsearchinhighdimensions(LSH)–  Clustering– Dimensionalityreduc2on

50VTCS5614Prakash2017

Page 51: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/lecture-16.pdf · CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16:

Tip:AddData

§  Leverageallthedata– Don’ttrytoreducedatasizeinanefforttomakefancyalgorithmswork

– Simplemethodsonlargedatadobest

§  Addmoredata– e.g.,addIMDBdataongenres

§ MoredatabeatsbeLeralgorithmshttp://anand.typepad.com/datawocky/2008/03/more-data-usual.html

VTCS5614 51Prakash2017