diapo
TRANSCRIPT
![Page 1: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/1.jpg)
1
Importance of Web Pages
PageRank
Topic-Specific PageRank
Hubs and Authorities
Combatting Link Spam
![Page 2: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/2.jpg)
2
PageRank
�Intuition: solve the recursive equation: “a page is important if important pages link to it.”
�In high-falutin’ terms: importance = the principal eigenvector of the transition matrix of the Web.
� A few fixups needed.
![Page 3: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/3.jpg)
3
Transition Matrix of the Web
� Enumerate pages.
� Page i corresponds to row and column i.
� M [i, j ] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i.
� M [i, j ] is the probability we’ll next be at page i if we are now at page j.
![Page 4: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/4.jpg)
4
Example: Transition Matrix
i
j
Suppose page j links to 3 pages, including i but not x.
1/3x
0
![Page 5: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/5.jpg)
5
Random Walks on the Web
�Suppose v is a vector whose i th
component is the probability that each random walker is at page i at a certain time.
�If each walker follows a link from i at random, the probability distribution for walkers is then given by the vector M v.
![Page 6: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/6.jpg)
6
Random Walks – (2)
�Starting from any vector v, the limit M (M (…M (M v ) …)) is the long-term distribution of walkers.
�Intuition: pages are important in proportion to how likely a walker is to be there.
�The math: limiting distribution = principal eigenvector of M = PageRank.
![Page 7: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/7.jpg)
7
Example: The Web in 1839
Yahoo
M’softAmazon
y 1/2 1/2 0a 1/2 0 1m 0 1/2 0
y a m
![Page 8: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/8.jpg)
8
Solving The Equations
�Because there are no constant terms, the equations v = M v do not have a unique solution.
�In Web-sized examples, we cannot solve by Gaussian elimination anyway; we need to use relaxation (= iterative solution).
�Can work if you start with a fixed v.
![Page 9: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/9.jpg)
9
Simulating a Random Walk
�Start with the vector v = [1, 1,…, 1] representing the idea that each Web page is given one unit of importance.
�Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk.
�About 50 iterations is sufficient to estimate the limiting solution.
![Page 10: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/10.jpg)
10
Example: Iterating Equations
�Equations v = M v :
y = y /2 + a /2
a = y /2 + m
m = a /2
ya =m
111
13/21/2
5/41
3/4
9/811/81/2
6/56/53/5
. . .
Note: “=” isreally “assignment.”
![Page 11: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/11.jpg)
11
The Walkers
Yahoo
M’softAmazon
![Page 12: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/12.jpg)
12
The Walkers
Yahoo
M’softAmazon
![Page 13: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/13.jpg)
13
The Walkers
Yahoo
M’softAmazon
![Page 14: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/14.jpg)
14
The Walkers
Yahoo
M’softAmazon
![Page 15: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/15.jpg)
15
In the Limit …
Yahoo
M’softAmazon
![Page 16: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/16.jpg)
16
Real-World Problems
�Some pages are “dead ends” (have no links out).
� Such a page causes importance to leak out.
�Other groups of pages are spider traps(all out-links are within the group).
� Eventually spider traps absorb all importance.
![Page 17: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/17.jpg)
17
Microsoft Becomes Dead End
Yahoo
M’softAmazon
y 1/2 1/2 0a 1/2 0 0m 0 1/2 0
y a m
![Page 18: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/18.jpg)
18
Example: Effect of Dead Ends
�Equations v = M v :
y = y /2 + a /2
a = y /2
m = a /2
ya =m
111
11/21/2
3/41/21/4
5/83/81/4
000
. . .
![Page 19: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/19.jpg)
19
Microsoft Becomes a Dead End
Yahoo
M’softAmazon
![Page 20: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/20.jpg)
20
Microsoft Becomes a Dead End
Yahoo
M’softAmazon
![Page 21: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/21.jpg)
21
Microsoft Becomes a Dead End
Yahoo
M’softAmazon
![Page 22: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/22.jpg)
22
Microsoft Becomes a Dead End
Yahoo
M’softAmazon
![Page 23: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/23.jpg)
23
In the Limit …
Yahoo
M’softAmazon
![Page 24: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/24.jpg)
24
M’soft Becomes Spider Trap
Yahoo
M’softAmazon
y 1/2 1/2 0a 1/2 0 0m 0 1/2 1
y a m
![Page 25: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/25.jpg)
25
Example: Effect of Spider Trap
�Equations v = M v :
y = y /2 + a /2
a = y /2
m = a /2 + m
ya =m
111
11/23/2
3/41/27/4
5/83/82
003
. . .
![Page 26: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/26.jpg)
26
Microsoft Becomes a Spider Trap
Yahoo
M’softAmazon
![Page 27: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/27.jpg)
27
Microsoft Becomes a Spider Trap
Yahoo
M’softAmazon
![Page 28: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/28.jpg)
28
Microsoft Becomes a Spider Trap
Yahoo
M’softAmazon
![Page 29: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/29.jpg)
29
In the Limit …
Yahoo
M’softAmazon
![Page 30: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/30.jpg)
30
PageRank Solution to Traps, Etc.
�“Tax” each page a fixed percentage at each interation.
�Add a fixed constant to all pages.
�Models a random walk with a fixed probability of leaving the system, and a fixed number of new walkers injected into the system at each step.
![Page 31: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/31.jpg)
31
Example: Microsoft is a Spider Trap; 20% Tax
�Equations v = 0.8(M v ) + 0.2:
y = 0.8(y /2 + a/2) + 0.2
a = 0.8(y /2) + 0.2
m = 0.8(a /2 + m) + 0.2
ya =m
111
1.000.601.40
0.840.601.56
0.7760.5361.688
7/115/11
21/11
. . .
![Page 32: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/32.jpg)
32
General Case
�In this example, because there are no dead-ends, the total importance remains at 3.
�In examples with dead-ends, some importance leaks out, but total remains finite.
![Page 33: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/33.jpg)
33
Solving the Equations
�Because there are constant terms, we can expect to solve small examples by Gaussian elimination.
�Web-sized examples still need to be solved by relaxation.
![Page 34: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/34.jpg)
34
Finding a Good Starting Vector
1. Newton-like prediction of where components of the principal eigenvector are heading.
2. Take advantage of locality in the Web.
� Each technique can reduce the number of iterations by 50%.
� Important – PageRank takes time!
![Page 35: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/35.jpg)
35
Predicting Component Values
�Three consecutive values for the importance of a page suggests where the limit might be.
1.0
0.70.6 0.55
Guess for the next round
![Page 36: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/36.jpg)
36
Exploiting Substructure
�Pages from particular domains, hosts, or directories, like stanford.edu or infolab.stanford.edu/~ullman
tend to have many internal links.
�Initialize PageRank using ranks within your local cluster, then ranking the clusters themselves.
![Page 37: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/37.jpg)
37
Strategy
�Compute local PageRanks (in parallel?).
�Use local weights to establish weights on edges between clusters.
�Compute PageRank on graph of clusters.
�Initial rank of a page is the product of its local rank and the rank of its cluster.
�“Clusters” are appropriately sized regions with common domain or lower-level detail.
![Page 38: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/38.jpg)
38
In Pictures
2.0
0.1
Local ranks
2.05
0.05Intercluster weights
Ranks of clusters
1.5
Initial eigenvector
3.0
0.15
![Page 39: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/39.jpg)
39
Topic-Specific Page Rank
�Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history.”
�Allows search queries to be answered based on interests of the user.
� Example: Query Maccabi wants different
pages depending on whether you are interested in sports or history.
![Page 40: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/40.jpg)
40
Teleport Sets
� Assume each walker has a small probability of “teleporting” at any tick.
� Teleport can go to:
1. Any page with equal probability.
� As in the “taxation” scheme.
2. A topic-specific set of “relevant” pages (teleport set ).
� For topic-specific PageRank.
![Page 41: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/41.jpg)
41
Example: Topic = Software
�Only Microsoft is in the teleport set.
�Assume 20% “tax.”
� I.e., probability of a teleport is 20%.
![Page 42: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/42.jpg)
42
Only Microsoft in Teleport Set
Yahoo
M’softAmazon
Dr. Who’sphonebooth.
![Page 43: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/43.jpg)
43
Only Microsoft in Teleport Set
Yahoo
M’softAmazon
![Page 44: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/44.jpg)
44
Only Microsoft in Teleport Set
Yahoo
M’softAmazon
![Page 45: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/45.jpg)
45
Only Microsoft in Teleport Set
Yahoo
M’softAmazon
![Page 46: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/46.jpg)
46
Only Microsoft in Teleport Set
Yahoo
M’softAmazon
![Page 47: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/47.jpg)
47
Only Microsoft in Teleport Set
Yahoo
M’softAmazon
![Page 48: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/48.jpg)
48
Only Microsoft in Teleport Set
Yahoo
M’softAmazon
![Page 49: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/49.jpg)
49
Picking the Teleport Set
1. Choose the pages belonging to the topic in Open Directory.
2. “Learn” from examples the typical words in pages belonging to the topic; use pages heavy in those words as the teleport set.
![Page 50: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/50.jpg)
50
Application: Link Spam
�Spam farmers today create networks of millions of pages designed to focus PageRank on a few undeserving pages.
�To minimize their influence, use a teleport set consisting of trusted pages only.
� Example: home pages of universities.
![Page 51: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/51.jpg)
51
Hubs and Authorities
�Mutually recursive definition:
� A hub links to many authorities;
� An authority is linked to by many hubs.
�Authorities turn out to be places where information can be found.
� Example: course home pages.
�Hubs tell where the authorities are.
� Example: CS Dept. course-listing page.
![Page 52: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/52.jpg)
52
Transition Matrix A
�H&A uses a matrix A [i, j ] = 1 if page ilinks to page j, 0 if not.
�AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions.
![Page 53: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/53.jpg)
53
Example: H&A Transition Matrix
Yahoo
M’softAmazon
y 1 1 1a 1 0 1m 0 1 0
y a m
A =
![Page 54: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/54.jpg)
54
Using Matrix A for H&A
�Powers of A and AT have elements of exponential size, so we need scale factors.
�Let h and a be vectors measuring the “hubbiness” and authority of each page.
�Equations: h = λAa; a = µAT h.
� Hubbiness = scaled sum of authorities of successor pages (out-links).
� Authority = scaled sum of hubbiness of predecessor pages (in-links).
![Page 55: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/55.jpg)
55
Consequences of Basic Equations
�From h = λAa; a = µAT h we can derive:� h = λµAAT h
� a = λµATA a
�Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority.� Pick an appropriate value of λµ.
![Page 56: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/56.jpg)
56
Example: Iterating H&A
1 1 1A = 1 0 1
0 1 0
1 1 0AT = 1 0 1
1 1 0
3 2 1AAT= 2 2 0
1 0 1
2 1 2ATA= 1 2 1
2 1 2
a(yahoo)a(amazon)a(m’soft)
===
111
545
241824
11484
114
. . .
. . .
. . .
1+√321+√3
h(yahoo) = 1h(amazon) = 1h(m’soft) = 1
642
1329636
. . .
. . .
. . .
1.0000.7350.268
2820
8
![Page 57: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/57.jpg)
57
Solving H&A in Practice
�Iterate as for PageRank; don’t try to solve equations.
�But keep components within bounds.
� Example: scale to keep the largest component of the vector at 1.
�Trick: start with h = [1,1,…,1]; multiply by AT to get first a; scale, then multiply by A to get next h,…
![Page 58: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/58.jpg)
58
Solving H&A – (2)
�You may be tempted to compute AAT
and ATA first, then iterate these matrices as for PageRank.
�Bad, because these matrices are not nearly as sparse as A and AT.
![Page 59: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/59.jpg)
59
H&A Versus PageRank
�If you talk to someone from IBM, they may tell you “IBM invented PageRank.”
� What they mean is that H&A was invented by Jon Kleinberg when he was at IBM.
�But these are not the same.
�H&A does not appear to be a substitute for PageRank.
�But may be used by Ask.com.
![Page 60: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/60.jpg)
60
Spam on the Web
�Search has become the default gateway to the web.
�Very high premium to appear on the first page of search results.
![Page 61: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/61.jpg)
61
What is Web Spam?
� Spamming = any action whose purpose is to boost a web page’s position in search engine results, without providing additional value.
� Spam = Web pages used for spamming.
�Approximately 10-15% of Web pages are spam.
![Page 62: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/62.jpg)
62
Web-Spam Taxonomy
�Boosting techniques :
� Techniques for increasing the probability a Web page will be a highly ranked answer to a search query.
�Hiding techniques :
� Techniques to hide the use of boosting from humans and Web crawlers.
![Page 63: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/63.jpg)
63
Hiding techniques
�Content hiding.
� Use same color for text and page background.
�Cloaking.
� Return different page to crawlers and browsers.
�Redirection.
� Redirects are followed by browsers but not crawlers.
![Page 64: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/64.jpg)
64
Boosting Techniques
�Term spamming :
� Manipulating the text of Web pages in order to appear relevant to queries.
• Why? You can run ads that are relevant to the query.
�Link spamming :
� Creating link structures that boost PageRank.
![Page 65: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/65.jpg)
65
Term Spamming – (1)
�Repetition : of one or a few specific terms e.g., “free,” “cheap,” “Viagra.”
�Dumping of a large number of unrelated terms.
� E.g., copy entire dictionaries.
![Page 66: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/66.jpg)
66
Term Spamming – (2)
�Weaving :
� Copy legitimate pages and insert spam terms at random positions.
� Phrase Stitching :
� Glue together sentences and phrases from different sources.
• E.g., use the top-ranked pages on the topic you want to look like.
![Page 67: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/67.jpg)
67
The Google Solution to Term Spamming
�In addition to PageRank, the original Google engine had another innovation: it trusted what people said about you in preference to what you said about yourself.
�Give more weight to words that appear in or near anchor text than to words that appear in the page itself.
![Page 68: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/68.jpg)
68
The Google Solution – (2)
�Today, the Google formula for matching terms to documents involves over 250 factors.
� E.g., does the word appear in a header?
�As closely guarded as the formula for Coke.
![Page 69: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/69.jpg)
69
Link Spam
� Three kinds of Web pages from a spammer’s point of view:
1. Own pages.
• Completely controlled by spammer.
2. Accessible pages.
• E.g., Web-log comment pages: spammer can post links to his pages.
3. Inaccessible pages.
![Page 70: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/70.jpg)
70
Spam Farms – (1)
� Spammer’s goal:
� Maximize the PageRank of target page t.
� Technique:
1. Get as many links from accessible pages as possible to target page t.
2. Construct “link farm” to get PageRankmultiplier effect.
![Page 71: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/71.jpg)
71
Spam Farms – (2)
InaccessibleInaccessible
t
Accessible Own
1
2
M
Goal: boost PageRank of page t.One of the most common and effectiveorganizations for a spam farm.
![Page 72: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/72.jpg)
72
Analysis – (1)
Suppose rank from accessible pages = x.
PageRank of target page = y.
Taxation rate = 1-β.
Rank of each “farm” page = βy/M + (1-β)/N.
InaccessibleInaccessiblet
Accessible Own
12
M
From t
Share of“tax”
Size ofWeb
![Page 73: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/73.jpg)
73
Analysis – (2)
y = x + βM[βy/M + (1-β)/N] + (1-β)/N
y = x + β2y + β(1-β)M/N
y = x/(1-β2) + cM/N where c = β/(1+β)
InaccessibleInaccessiblet
Accessible Own
12
M
Tax sharefor t.Very small;ignore.
PageRank ofeach “farm” page
![Page 74: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/74.jpg)
74
Analysis – (3)
�y = x/(1-β2) + cM/N where c = β/(1+β).
�For β = 0.85, 1/(1-β2)= 3.6.
� Multiplier effect for “acquired” page rank.
�By making M large, we can make y as large as we want.
InaccessibleInaccessiblet
Accessible Own
12
M
![Page 75: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/75.jpg)
75
Detecting Link-Spam
�Topic-specific PageRank, with a set of “trusted” pages as the teleport set is called TrustRank.
�Spam Mass = (PageRank – TrustRank)/PageRank.
� High spam mass means most of your PageRank comes from untrusted sources –you may be link-spam.
![Page 76: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/76.jpg)
76
Picking the Trusted Set
�Two conflicting considerations:
� Human has to inspect each seed page, so seed set must be as small as possible.
� Must ensure every “good page” gets adequate TrustRank, so all good pages should be reachable from the trusted set by short paths.
![Page 77: Diapo](https://reader033.vdocuments.net/reader033/viewer/2022052621/55802659d8b42aac768b4beb/html5/thumbnails/77.jpg)
77
Approaches to Picking the Trusted Set
1. Pick the top k pages by PageRank.
� It is almost impossible to get a spam page to the very top of the PageRank order.
2. Pick the home pages of universities.
� Domains like .edu are controlled.