pagerank - information retrievalir.cis.udel.edu/~carteret/cisc689/slides/lecture20.pdf ·...
TRANSCRIPT
5/4/09
1
PageRank
CISC489/689‐010,Lecture#20Wednesday,April29th
BenCartereGe
WebSearch
• Problem:– Websearchengineseasilyhacked– Iwanttosellsomething;I’lljustaddafewpopularkeywordstomypageoverandoverandoveragain
– Alltheretrievalmodelswe’vediscussedwillscorethatpagehigherforthosekeywords
• Otherproblems:– Nohackers,buttop‐rankedpagesarecomingfromdeepwithinasite,orfrompagesthatchangeoTen,orpagesaboutveryobscuretopics
– Notreallyuseful
5/4/09
2
PossibleSoluUon
• Leveragelinkstructure• Maybeifmanypagesarelinkingtoapage,thatpageismore“important”
• Idea:– Countthenumberofinlinkstothepage– Assignit“importance”basedonthatnumber
• Anyproblemwiththis?
HackingLinkCounts
• Icanjustmakeabunchofpagesthatlinktomyspampage
• Inlinkcountwillbehigheventhoughmypageisnotimportant
• BeGeridea:– RecursivelyusetheimportanceofthelinkingpageswhencalculaUngtheimportanceofthepage
5/4/09
3
PageRank
• Google’sPageRankisprobablythebestknownalgorithm
• IntuiUveidea:“randomsurfer”model– Ifyoustartonarandompageontheinternetandjuststartclickinglinksrandomly,
– Whatistheprobabilityyouwilllandonpageu?– Ifonepagehasahigherlandingprobability,thepagesitlinkstohavehigherlandingprobabiliUesaswell
– Higherprobability=moreauthority=beGerPageRank
IllustraUon
5/4/09
4
PageRankDefiniUon
R(u) = c!
v!Bu
R(v)Nv
R(u) is the PageRank of u
Bu is a set of pages that link to u
v is one of the pages in Bu
R(v) is the PageRank of v
Nv is the number of pages that v links to
Sinks
• Problem:– Ihavetwopagesthatonlylinktoeachother,plusonepagethatlinkstooneofthem
– Whenthe“randomsurfer”getstooneofthosepages,hewilljustkeepalternaUngbetweenthem
– TheirPageRankwilldominateeverythingelse
• SoluUon:onceinawhiletherandomsurferjuststartsoveratanewpage– OK,buthowdoIputthatinPageRank?
5/4/09
5
RandomRestarts
R!(u) = c!
v"Bu
R!(v)Nv
+ cE(u)
E(u) is the probability that a random surfer jumps to u
CalculaUngPageRank
• AsimpleiteraUvealgorithm:– First,assignaPageRanktoeverypage
• E.g.R0(u)=1/N– IniUalizeE(u)=αforallu(0.15/Ninoriginalpaper)– OveriteraUonsi=1…,do
• UpdateeachPageRankas:• CalculatedasthesumofPageRanksfromthepreviousiteraUonminusthesumofPageRanksfromthecurrentiteraUon
• UpdateeachPageRankas• Calculateδasthesumof|PageRanksfromthecurrentiteraUonminusPageRanksfromthepreviousiteraUon|
• Ifδ>ε,PageRankshaveconverged
Ri+1(u) =!
v!Bu
Ri(v)Nv
Ri+1(u) = Ri+1 + d!
5/4/09
6
Scalability
• CalculaUngPageRankrequiresavectorforeveryURL
• AlongwithalistoftheURLsthatlinktothatURLandalistofURLsthatpagelinksto
• SpaceusageisO(N2)
• TimecomplexityalsoO(N2)
• EvenforsmallcollecUons(likeWikipedia),itisnearlyimpossibletokeepallofthisinmemory
CalculaUngPageRankwithMapReduce
• First:extractlinks– Foreverypage,Ineedtoknowallthepagesthatlinktoit
– MapReducesoluUon:• Mapoperatortakespageuandoutputs(v,u)foreveryURLvinpageu
• ReduceoperatortakesalltupleswithkeyvandreducesthemtoalistofuniqueURLsButhatlinktov– (v,(u1,u2,u3,…))
5/4/09
7
CalculaUngPageRankwithMapReduce
• Next:calculatePageRankiteraUvely– First,iniUalizeallPageRanksto1/N– TheniteraUvelyMapReduce • MapoperatortakesapageuiandtheURLsvj(j=1..n)thatitlinksto,andoutputs(vj,R(ui)/n)• Reduceoperatortakespairs(vj,R(ui)/n)andoutputs
• Calculatedeltaanddeterminewhetherconverged• Ifnot,MapReduceagain
!vj , (1! d) + d
m"
i=1
R(ui)n
#
PageRankforIR
• PageRankisquery‐independent– Theimportanceofapageisnotrelatedtoanyquery
– WecannotsimplyrankpagesbyPageRank
• PageRankcanbeusedtore‐rankresultsthathavebeenretrievedforaquery
• ItcanalsobeusedasafeatureintherankingfuncUon
• Orasaweightonanchortextfeatures
5/4/09
8
ExampleUseofPageRank
From“ThePageRankCitaUonAlgorithm:BringingOrdertotheWeb”,Pageetal.
WikipediaPageRanks
PageTitle PageRank
UnitedStates 2.9x10‐3
France 1.3x10‐3
UnitedKingdom 1.2x10‐3
England 1.0x10‐3
Germany 1.0x10‐3
Canada 0.9x10‐3
2007 0.8x10‐3
WorldWarII 0.8x10‐3
Australia 0.7x10‐3
2008 0.7x10‐3
BasedonlinksbetweenWikipediapages
5/4/09
9
ABitofTheory
• Markovchain:– NstateswithtransiUonprobabilitymatrixP
– AtanyUmeweareinexactlyonestate– Pijindicatesprobabilityofmovingfromstateitostatej
– Foralli,n!
j=1
Pij = 1
ABitofTheory
• ErgodicMarkovchains– Ergodicmeansthereisapathbetweenanytwostates
– NomaGerwhatstateyoustartin,theprobabilityofbeinginanyotherstateaTerTstepsisgreaterthanzero(aTeraburn‐inUmeT0)
– OvermanystepsT,eachstatehassome“visitrate”:starUngfromanystate,wewillvisiteachstateaccordingtoitsvisitrate
– Thisisthesteady stateforthechain
5/4/09
10
ABitofTheory
• Letx0beavectorrepresenUngourcurrentstate
• WhatistheprobabilityofeachpossiblestatewecantransiUontofromx0?
• Andtheprobabilityofeachpossiblestatefromx1(twostepsfromx0)?
1inposiUoni,0severywhereelsex0 =
!0 0 0 · · · 1 · · · 0 0
"
x1 = x0P
x2 = x1P = (x0P )P = x0P2
ABitofTheory
• ATerksteps,• Askgoestoinfinity,xkconvergestothesteadystate
• Whenxkisthesteadystate,• ThesteadystateisaneigenvectorofP– Asitturnsout,itistheprinciple eigenvector– Eigenvalue=1
• IfPisamatrixoflinksbetweendocuments,theprincipleeigenvectorholdsthePageRanks
xk = x0Pk
xkP = xk
5/4/09
11
PageRankModificaUons
• TheE(u)quanUUessolvethesinkproblem,butcanalsobeusedtoadjustPageRanks
• Usually,E(u)assigneduniformly– Equalprobabilitytojumptoanypage
• Instead,biastocertainpages– Onepossibility:assignonepageE(u)=α,allotherpages0– Forexample,Yahoohomepage– Thenwhenthe“randomsurfer”restarts,shealwaysrestartsatthesameplace
– Result:YahoogetshighestPageRank,followedbypagesYahoolinksto
Topic‐BasedPageRank
• Adifferentkindofrandomsurfer:– Firstpicksacategoryrandomly– Thenjumpstoapagerandomlywithinthatcategory
• InsteadofcalculaUngasinglePageRankforeachpage,calculateMPageRanks,oneforeachcategory– CategoryPageRank=PageRankamongotherpagesinthesamecategory
5/4/09
12
Topic‐BasedPageRankforPersonalizaUon
• Eachindividualuserismoreinterestedinsomecategoriesthanothers
• Calculatetheprobabilitythatauserisinterestedinacategorybasedonthefrequencytheyvisitpagesinthatcategory– E.g.sports=0.7,finance=0.2,health=0.1,allothers=0
• ThenthepersonalizedPageRankforapageuistheweightedsumofcategoryPageRanksforthatpage
R(u) =M!
i=1
piRi(u) piistheuser’scategoryiprobabilityRi(u)isthePageRankofuforcategoryi
Hyperlink‐InducedTopicSearch(HITS)
• Anotherlink‐graphalgorithm• Idea:– Somepagesareauthorita6ve:theyareveryinformaUveaboutatopic
– Otherpagesarehubs foratopic:theylinktoalotofpagesonthetopic
• Example:– CiteSeerlinkstoalotofcomputerscienceresearchpapers—it’sacomputerscienceresesarchhub
– ThepapersitlinkstoarecomputerscienceresearchauthoriUes
• FindbothhubsandauthoriUes
5/4/09
13
HubsandAuthoriUes
• HubsarepagesthatlinktoalotofauthoriUes• AuthoriUesarepagesthatarelinkedtobyalotofhubs
• AnotherrecursivedefiniUon
HITSAlgorithm
• First,getaroot setofpages– Pagesthatmatchaquery,forexample
• Fromtherootset,constructabase set– Pagesthatlinktotherootsetandpagesthattherootsetlinkto
From“AuthoritaUveSourcesinaHyperlinkedEnvironment”,J.Kleinberg
5/4/09
14
HITSAlgorithm
• IniUalize“hubscore”h(u)=1and“authorityscore”a(u)=1foreachpageuinthebaseset
• TheniteraUvelyupdateh(u)anda(u)forallu:
• ATereachiteraUon,divideh(u)anda(u)bysomeconstant
• ATeronlyafewiteraUons,scoresconverge
h(u) =!
v!Fu
a(v)
a(u) =!
v!Bu
h(v)
Fu=setofpagesthatulinkstoBu=setofpagesthatlinktou
HITSExampleResults
FromhGp://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/Images/cars1.png
5/4/09
15
MatrixForm
• PisthetransiUonmatrix• PP’isasortof“similarity”matrixintermsoflinkstootherpages– Entryi,jishigherifpagesiandjlinktothesamepages
• P’Pisasortof“similarity”matrixintermsoflinksfromotherpages– Entryi,jishigherispagesiandjarelinkedtofromthesamepages
MatrixForm
• Wecanwritethehubscoreasamatrix‐vectorproduct:– h=Pa(transiUonmatrixUmesauthorityscore)
• Wecanwriteauthorityscoreas– a=P’h(transposeofthetransiUonmatrixUmeshubscore)
• SubsUtuUng,weget– h=PP’h– a=P’Pa
5/4/09
16
MatrixEigenvectors
• Ifh=PP’h,thenhisaneigenvectorofthe“outlinksimilaritymatrix”– Hubscoresareabasisvectorofaspacedefinedbyoutlinks
• Ifa=P’Pa,thenaisaneigenvectorofthe“inlinksimilaritymatrix”– Authorityscoresareabasisvectorofaspacedefinedbyinlinks