search engines and google - universitetet i oslo · search engines • search engine queries are...

Search Engines and Google

Francisco Velázquez3. Nov. 2010

1

Motivation

• Human maintained lists are subjective, expensive to build and maintain, slow to improve and cannot cover all esoteric topics.

• Automated search engines that rely on keyword matching return low quality matches.

• Advertisers mislead automated search engines.

• Scalability in search engines must meet WWW growth.

2

Content• 3 Tier Framework

• Components of a search engine

• Crawler

• PageRank

• Indices

• Map-Reduce Parallelism Framework

• Finding Similar Pages

• Jaccard Measure of Similarity

• Minhashing

• Locality-Sensitive Hashing

• Google

3

3 Tier Framework

http://goo.gl/CmFF

4

http://goo.gl/CmFF

http://goo.gl/CmFF

The components of a search engine

5

Crawler

• A process that downloads web pages to a Page Repository.

• Examine pages for links to other pages and insert the ones that are not in the Page Repository in the set for pages to be crawled. http://goo.gl/gG3s

6

http://goo.gl/gG3s

http://goo.gl/gG3s

CrawlerChallenge Description Solution

Terminating search Dynamically generated pages could create a forever loop

Limit number of pages to crawl with a “depth” limit per site

Managing the repository

1. Duplication of URL to be crawled

2. Duplicated pages due to mirror sites, different routes, plagiarism, etc.

1. An efficient index for checking stored pages

2. Minhash and locality-sensitive hashing signatures

Selecting the next page How to prioritise next page to be crawled? Give priority to “important” pages

Speeding up the crawl

1. How many processes should be simultaneously run?

2. How to synchronise them to avoid they crawl the same site.

3. Avoid DoS attack

1. Scale to several machines2. Assign processes to entire hosts

or sites3. Do not issue frequent requests

to a single site. Several processes in a single machine due to idle states.

7

Query Processing in Search Engines

• Search engine queries are not like SQL queries

• Require inverted indices

• Disk access is very expensive to offer the user acceptable response time

• Matched records are ranked before showing to the user

8

PageRank• Algorithm for identifying

“important” pages

• A Web page is important if many important pages link to it

http://goo.gl/gKsQ

http://goo.gl/CsuN

9

http://goo.gl/gKsQ

http://goo.gl/gKsQ

http://goo.gl/CsuN

http://goo.gl/CsuN

Recursive Formulation of Page Rank

Yahoo!

Amazon Microsoft

The Web in 1839

Transition Matrix

1/2 1/2 0M = 1/2 0 1

0 1/2 0

Yaho

o!

Am

azon

Mic

roso

ft

Amazon

Yahoo!

Microsoft

The Matrix M, the transition matrix of the Web has element rank r, mij in row i and column j, where

1.mij = 1/r if page j has a link to page i, and there are a total of r≥1 pages that j links to

2.mij = 0 otherwise

10

Suppose y, a, and m represent PageRanks and fractions of the time the random walker spends

y 1/2 1/2 0 y

a = 1/2 0 1 a

m 0 1/2 0 m

2/6 1/2 1/2 0 1/3

3/6 = 1/2 0 1 1/3

1/6 0 1/2 0 1/3

5/12 1/2 1/2 0 2/6

4/12 = 1/2 0 1 3/6

3/12 0 1/2 0 1/6

After repeating the process several times:

9/24 20/48 2/5

11/24 , 17/48 , … , 2/5

4/24 11/48 1/5

Yahoo!

Amazon

Microsoft

Suggested since the probability of y+a+m=1

11

Spider Traps and Dead Ends

Microsoft becomes a spider trap

Yahoo!

Amazon Microsoft

Yahoo!

Amazon Microsoft

Microsoft becomes a dead end

0

0

1

Yahoo!

Amazon

Microsoft

0

0

0

Yahoo!

Amazon

Microsoft

12

Spider traps and dead ends solution

• Limit the time that random walker is allowed to wander at random

• Pick a constant β<1, typically in the range 0.8 to 0.9.

• Taxation rate: 1-β

• If the walker gets stuck in a spider trap, it will disappear and be replace by a new walker after few time steps

• If the walker reaches a dead end and disappears, a new walker will take over shortly

13

1/2 1/2 0 1/3

Pnew = 0.8 1/2 0 0 Pold + 0.2 1/3

0 1/2 1 1/3

Yahoo!

Amazon Microsoft

Microsoft becomes a spider trap

7/33

5/33

21/33

After several iterations Yahoo!

Amazon

Microsoft

14

Teleport Sets

• Selected set of nodes

• Eliminate spam and pages that don’t concern to the search topic

• Nodes are selected from trusted open directories, keywords in pages on a topic, users’s bookmarks, recently searched keywords, etc.

15

Yahoo!

Amazon Microsoft

The Web in 1839

y 1/2 1/2 0 y 0

a = 0.8 1/2 0 1 a + 0.2 1

m 0 1/2 0 m 0

10/31

15/31

6/31

After several iterations Yahoo!

Amazon

Microsoft

Pnew = β M Pold + (1-β)t

16

Link Spam

• Spam farming in order to accumulate and concentrate PageRank on a few pages

• Links to the spam farm from pulicly accessible blogs, with messages like “I agree with you. See x1234.mySpam.Farm.com”

S

…

…

Links from outside

17

Link Spam Solution

• Compute the TrustRank of pages

• TrustRank: Topic-specific PageRank computed with a Teleport set consisting of only “trusted” pages

• Manual trusted pages collection

• User Teleports with sets of serious pages such as universities

• Compute the difference between the PageRank and TrustRank for each page. This difference is the negative TrustRank

18

Indices

Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2

0 1 2

the cat is fat

was raining cats and dogs

Fido the dog

Inverted IndexInverted Index

and 1

cat 0, 1

dog 1, 2

fat 0

fido 2

is 0

raining 1

the 0, 2

was 1

19

Inverted Indices

• Essential for Web Queries

• Uses indirect buckets for space efficiency

Buckets

cat

dog

Inverted Index

... the cat is fat ...

... was raining cats and dogs ...

... Fido the dog ...

Documents

20

Sorting more information in the inverted index

Type Position Document

title 5

header 10

anchor 3

text 57

title 100

title 12

Doc 1

Doc 2

Doc 3

Cat

Dog

Dogs compared with cats

21

Map-Reduce Parallelism Framework

• Large-scale parallel machines share high load operations such as joins

• Distributed architectures

• Grid, networks and corporate DBs

• MRP paradigm expresses large-scale computations

Map Reduce

InputKey-Value

Pairs

OutputLists

Sort IntermediateKey-Value

Pairs by Keys

Execution of map and reduce functions

22

Jaccard Measure of Similarity

• Finding Similar Items

• Jaccard similarity is the radio of the sizes of interaction and union the sets S and T.

|S⋂T|/|S⋃T|

{1,2,3} and {1,3,4,5} has radio 2/5

• A set of k-grams or k-Shingle is a substring of length k of a set.

“A number of …” “A n”, “ nu”, “num”, and so on.

23

Minhashing

• It is a technique to form a short signature for each set

• Computes the Jaccard similarity using signatures

• A minhash value of a set S is the first element of a randomly permuted universal set, that is a member of S

• Universal set of elements is {1,2,3,4,5} and a permuted order is: (3,5,4,2,1). Then, the hash value for the set {2,3,5} is 3.

24

Locality-Sensitive Hashing (LSH)

• Minhashing is fast but there are still too many pairs of sets

• LSH hashes sets to buckets so that “similar” elements are assigned to the same bucket

• Tradeoffs number of buckets (constrained by memory) and chances to miss a pair of similar elements

25

n signatures

r rows

r bands

Buckets

Dividing signatures into bands and hashing based on the values in a band

s = (1/b) 1/r

Probability of at least one bucket in common

Similarity s

1

1

0

0

The probability that a pair of signatures will appear together in at least one bucket

26

Combining Minhashing and LSH

1. Compute minhash signature with as many hash functions as desired accuracy

2. Perform LSH to get candidate pairs of signatures that hash to the same bucket for at least one band

3. For each candidate pair, compute the estimate of their Jaccard similarity by counting the number of components in which their signature agree

4. Optionally, for each pair whose signatures are sufficiently similar, compute their true Jaccard similarity by examining the sets themselves

27

Google Apps28

Anatomy of a Google Search

• Uses: links, PageRank, anchors, proximity and visual presentation (e.g. bold text is weighted higher) in search logic. Search the index

1. Search the index

2. Analyze the web pages for relevance

3. Evaluate the site’s reputation

4. Rank the web pages

29

Google’s System Anatomy

http://goo.gl/yYbb30

http://infolab.stanford.edu/~backrub/google.html


Google particularities

• PageRank

• Anchor text

• Location information and use of proximity in search

• Visual presentations such as font, capitalization and size of words are weighted differently

31

References

• The Anatomy of a Large-Scale Hypertextual Web Search Engine


• Database Systems. The Complete Book. Second Edition. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom

32





Questions

[email protected]

33

mailto:[email protected]

mailto:[email protected]