SocialandTechnologicalNetworkDataAnalytics
Lecture5:StructureoftheWeb,SearchandPowerLaws
ProfCeciliaMascolo
InThisLecture
• Wedescribepowerlawnetworksandtheirpropertiesandshowexamplesofnetworkswhicharepowerlawinnature,includingtheweb.
• Wepresentthepreferentialattachmentmodelwhichallowsthegenerationofpowerlawnetworks.
• Westudypredictionofpowerlaws• WeintroducesearchandPageRank
Precursorofhypertexts
• Citationnetworksofbooksandarticles.
• Difference:linkspointonlybackwardsintime
WebisaDirectedGraph
• Path:ApathfromAtoBexistsifthereisasequenceofnodesbeginningwithAandendingwithBsuchthateachconsecutivepairofnodesisconnectedbyanedgepointingintheforwarddirection.
A
B
CD
E
StronglyConnectedComponent
• Astronglyconnectedcomponent(SCC)inadirectedgraphisasubsetofnodessuchthat:
i)Everypairinthesubsethasapathtoeachotherii)Thesubsetisnotpartofsomelargersubsetwithpropertyi)
• Weaklyconnectedcomponent(WCC)istheconnectedcomponentintheundirectedgraphderivedfromthedirectedgraph.– TwonodescanbeinthesameWCCeveniftherenodirectedpathbetweenthem.
PopularityofWebPages
• Howdoweexpectthepopularityofwebpagestobedistributed?–Whatfractionofwebpageshavek in-links?
– Ifeachpagedecidesindependentlyatrandomwhethertolinktoanygivenotherpagethenthenofin-linksofapageisthesumofindependentrandomquantities->normaldistribution
– Inthiscase,thenumberpageswithkin-linksdecreasesexponentiallyink
– IsthistruefortheWeb?
DegreedistributionfortheWeb• Finding:degreedistr.proportionalto~1/k2• 1/k2 decreasesmuchmoreslowlythananormaldistribution
DiameteroftheWeb
• 75%ofthetimethereisnodirectedpathbetweentworandomnodes
• Averagedistanceofexistingpaths:16• Averagedistanceofundirectedpaths:6.83
• DiameterintheSCCisatleast28
PowerLawsakaScaleFreeNetworks
• Wehaveseenthatthedegreedistributionfollowedastraightlineinlog-log
• α definestheslopeofthecurve• α istypicallybetween2and3.
€
ln pk = −α lnk + cpk = Ck−α
What’sagoodmodelforscalefreenetworks
• Let’susethewebnetworkasexample:• Pagesarecreatedinorder(1,2,3..)• Pagej createdanditlinkstoanearlierpageinthefollowingway:– Withprob.p,j choosespagei atrandomandlinksit;– Withprob.1-p,jchoosespagei atrandomandlinkstothepagei pointsto.
– Repeat.• Themiddlestepisessentiallyacopyofthenodeibehaviour…
Preferentialattachment
• Pagesarecreatedinorder(1,2,3..)• Pagej createdanditlinkstoanearlierpageinthefollowingway:–Withprob.p,j choosespagei atrandomandlinksit;
–Withprob.1-p,j choosesapagez withprob.proportionaltoz’s currentnumberofin-linksandlinkstoz (ie proportionaltodegree).
– Repeat.Rich-get-richermodelIfwerunthisformanypagesthefraction ofpageswithkin-linkswillbedistributedapproximately according toapowerlaw1/kccdepends onp
Intuition
• Withprobability1-ppagejchoosesapageiwithprobabilityproportionaltoi’snumberofin-linksandcreatesalinktoi.
• Thismechanismpredictsthatthegrowthhappenssothat– Apage’spopularitygrowthatarateproportionaltoitscurrentvalue.
– Therichgetrichereffectamplifiesthelargervalues
PreferentialAttachment
• Whathaveweshown?• Thereisa“copying”behaviour happeninginthesenetworkswherenodeseemtoemulateothernodes.
• Thisisshowntrueforselectionofbooks,songs,webpages,moviesetc.
Howpredictableistherich-get-richerprocess?
• Isthepopularityofitemsinthepowerlawpredictable?
• Wouldapopularbookstillbepopularifwegobackintimeandstarttheprocessagain?
• Experimentsshowitwouldnot…
Let’stransformthefunction
• Iftheinitialfunctionisapowerlaw,thisoneistoo(wedonotprovethis)
Saleranking
Nichetastes
Popularitymeansthis
Search
– Informationretrievalproblem:synonyms(jump/leap),polysemy(Leopard),etc
– Nowwiththeweb:diversityinauthoringintroducesissuesofcommoncriteriaforrankingdocuments
– Theweboffersabundanceofinformation:whomdowetrustassource?
• Stilloneissue:staticcontentversusrealtime–Worldtradecenterqueryon11/9/01– Twitterhelpssolvingtheseissuesthesedays
AutomatetheSearch
• Whensearching“ComputerLaboratory”onGooglethefirstlinkisforthedepartment’spage.
• HowdoesGoogleknowthisisthebestanswer?• Wecouldcollectalargesampleofpagesrelevantto“computerlaboratory”andcollecttheirvotesthroughtheirlinks.
• Thepagesreceivingmorein-linksarerankedfirst.• Butifweusethenetworkstructuremoredeeplywecanimproveresults.
Example:Query“newspaper”Authorities
• Linksareseenasvotes.
• Authoritiesareestablished:thehighlyendorsedpages
RepeatingandNormalizing
• Theprocesscanberepeated• Normalization:– Eachauthorityscoreisdividedbythesumofallauthorityscores
– Eachhubscoreisdividedbythesumofallhubscores
MoreFormally:doestheprocessconverge?
• Eachpagehasanauthorityai andahubhiscore
• Initiallyai=hi =1
• Ateachstep
• Normalize
€
ai = h jj−> i∑
h j = aij−> i∑
€
ai∑ =1
h j∑ =1
PageRank
• Wehaveseenhubsandauthorities– Hubscan“collect”linkstoimportantauthoritieswhodonotpointtoeachothers
– Thereareothermodels:betterfortheweb,whereoneprominentcanendorseanother.
• ThePageRankmodelisbasedontransferrableimportance.
PageRank Concepts
• Pagespassendorsementsonoutgoinglinksasfractionswhichdependonout-degree
• InitialPageRankvalueofeachnodeinanetworkofnnodes:1/n.
• Chooseanumberofstepsk.• [Basic]Updaterule:eachpagedividesitspagerank equallyovertheoutgoinglinksandpassesanequalsharetothepointedpages.Eachpage’snewrankisthesumofreceivedpageranks.
Convergence
• Exceptforsomespecialcases,PageRankvaluesofallnodesconvergetolimitingvalueswhenthenumberofstepsgoestoinfinity.
• TheconvergencecaseisonewherethePageRankofeachpagedoesnotchangeanymore,i.e.,theyregeneratethemselves.
Solution:TheREALPageRank
• [Scaled]UpdateRule:– Applybasicupdaterule.Then,scaledownallvaluesbyscalingfactors [chosenbetween0and1].
– [TotalnetworkPageRankvaluechangesfrom1tos]– Divide1-sresidualunitsofPageRank equallyoverallnodes:(1-s)/neach.
• Itcanbeproventhatvaluesconvergeagain.• Scalingfactorusuallychosenbetween0.8and0.9
SearchRankingisveryimportanttobusiness
• Achangeinresultsinthesearchpagesmightmeanlossofbusiness– I.e.,notappearingonfirstpage.
• Rankingalgorithmsarekeptverysecretandchangedcontinuously.
PageRank asRandomWalk
• TheprobabilityofbeingatapageXafterkstepsofarandomwalkispreciselythePageRank ofXafterk applicationsoftheBasicPageRank UpdateRule
• ScaledUpdateRuleequivalent:followarandomoutgoinglinkwithprobabilitys whilewithprobability1-sjumptoarandomnodeinthenetwork.
References• Chapter13,14and18
• AndreiBroder,RaviKumar,Farzin Maghoul,Prabhakar Raghavan,SridharRajagopalan,Raymie Stata,AndrewTomkins, andJanetWiener.GraphstructureintheWeb.InProc.9thInternationalWorldWideWebConference,pages309-320,2000.
• A.Clauset,C.R.Shalizi andM.E.J.Newman,2009.“Power-lawdistributionsinempiricaldata.”SIAMReviewVol.51,No.4.(2Feb2009),661.
• Barabási,Albert-László andRéka Albert,"Emergenceofscalinginrandomnetworks",Science,286:509-512,October15,1999
• MatthewSalganik,PeterDodds,andDuncanWatts.Experimentalstudyofinequality andunpredictabilityinanartificialculturalmarket.Science,311:854-856,2006.
Barabasi’s bookhasagoodchapteronscalefreenetworkstoo!