link analysis {week 09}
DESCRIPTION
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Link Analysis {week 09}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/1.jpg)
Link Analysis{week 09}
The College of Saint RoseCSC 460 / CIS 560 – Search and Information RetrievalDavid Goldschmidt, Ph.D.
from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
![Page 2: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/2.jpg)
Are you connected?
The Internet (1969) is a network that’s Global Decentralized Redundant Made up of many different types of
machines
How many machines make up the Internet?
![Page 3: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/3.jpg)
Browsing the Web
from Fluency with Information Technology, 4th editionby Lawrence Snyder, Addison-Wesley, 2010, ISBN 0-13-609182-2
![Page 4: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/4.jpg)
The World Wide Web
Sir Tim Berners-Lee
![Page 5: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/5.jpg)
Weaving the Web
The World Wide Web (or just Web) is: Global Decentralized Redundant (sometimes) Made up of Web pages
and interactive Web services
How many Web pages are on the Web?
![Page 6: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/6.jpg)
Links
Links are useful to us humans fornavigating Web sites and finding things
Links are also useful to search engines <a href="http://cnn.com"> Latest News
</a> anchor textdestination link (URL)
![Page 7: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/7.jpg)
Anchor text
How does anchor text apply to ranking? Anchor text describes the
content of the destination page Anchor text is short, descriptive,
and often coincides with query text Anchor text is typically written
by a non-biased third party
![Page 8: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/8.jpg)
The Web as a graph (i)
We often represent Web pages as vertices and links as edges in a webgraph
http://www.openarchives.org/ore/0.1/datamodel-images/WebGraphBase.jpg
![Page 9: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/9.jpg)
The Web as a graph (ii)
http://www.growyourwritingbusiness.com/images/web_graph_flower.jpg
An example:
![Page 10: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/10.jpg)
Using webgraphs for ranking Links may be interpreted as describing
a destination Web page in terms of its: Popularity Importance
We focus on incoming links (inlinks) And use this for ranking matching documents Drawback is obtaining incoming link data
Authority Incoming link count
![Page 11: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/11.jpg)
PageRank (i)
PageRank is a link analysis algorithm PageRank is accredited to Sergey Brin
and Lawrence Page (the Google guys!) The original PageRank paper:▪ http://infolab.stanford.edu/~backrub/google.h
tml
![Page 12: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/12.jpg)
PageRank (ii)
Browse the Web as a random surfer: Choose a random number r between 0 and 1 If r < λ then go to a random page else follow a random link from the current
page Repeat!
The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page
![Page 13: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/13.jpg)
PageRank (iii)
Jumping to a random pageavoids getting stuck in: Pages that have no links Pages that only have broken links
Pages that loop back to previously visited pages
![Page 14: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/14.jpg)
PageRank (iv)
PageRank of page C is theprobability a random surferis viewing page C Based on inlinks PR(C) = PR(A) / 2 + PR(B) / 1
We assume PageRank is distributed evenly across all pages (so 0.33 for A, B, and C) PR(C) = 0.33 / 2 + 0.33 / 1 = 0.50
![Page 15: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/15.jpg)
PageRank (v)
More generally:
Bu is the set of pages that point to u Lv is the number of outgoing links from
page v (not counting duplicate links)
![Page 16: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/16.jpg)
PageRank (vi)
We can account for the “random jumps” by incorporating constant λ into the equation:
Typically, λ is low (e.g. λ = 0.15)
(N is the number of pages)
![Page 17: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/17.jpg)
![Page 18: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/18.jpg)
Link quality (and avoiding spam) A cycle tends to negate the
effectiveness of thePageRank algorithm
![Page 19: Link Analysis {week 09}](https://reader036.vdocuments.net/reader036/viewer/2022062501/56815e31550346895dcc8d15/html5/thumbnails/19.jpg)
What next?
Read and study Chapter 4.5