search engines and google - universitetet i oslo · search engines • search engine queries are...

33
Search Engines and Google Francisco Velázquez 3. Nov. 2010 1

Upload: others

Post on 23-Sep-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Search Engines and Google

Francisco Velázquez3. Nov. 2010

1

Page 2: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Motivation

• Human maintained lists are subjective, expensive to build and maintain, slow to improve and cannot cover all esoteric topics.

• Automated search engines that rely on keyword matching return low quality matches.

• Advertisers mislead automated search engines.

• Scalability in search engines must meet WWW growth.

2

Page 3: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Content• 3 Tier Framework

• Components of a search engine

• Crawler

• PageRank

• Indices

• Map-Reduce Parallelism Framework

• Finding Similar Pages

• Jaccard Measure of Similarity

• Minhashing

• Locality-Sensitive Hashing

• Google

3

Page 4: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

3 Tier Framework

http://goo.gl/CmFF

4

Page 5: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

The components of a search engine

5

Page 6: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Crawler

• A process that downloads web pages to a Page Repository.

• Examine pages for links to other pages and insert the ones that are not in the Page Repository in the set for pages to be crawled. http://goo.gl/gG3s

6

Page 7: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

CrawlerChallenge Description Solution

Terminating search Dynamically generated pages could create a forever loop

Limit number of pages to crawl with a “depth” limit per site

Managing the repository

1. Duplication of URL to be crawled

2. Duplicated pages due to mirror sites, different routes, plagiarism, etc.

1. An efficient index for checking stored pages

2. Minhash and locality-sensitive hashing signatures

Selecting the next page How to prioritise next page to be crawled? Give priority to “important” pages

Speeding up the crawl

1. How many processes should be simultaneously run?

2. How to synchronise them to avoid they crawl the same site.

3. Avoid DoS attack

1. Scale to several machines2. Assign processes to entire hosts

or sites3. Do not issue frequent requests

to a single site. Several processes in a single machine due to idle states.

7

Page 8: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Query Processing in Search Engines

• Search engine queries are not like SQL queries

• Require inverted indices

• Disk access is very expensive to offer the user acceptable response time

• Matched records are ranked before showing to the user

8

Page 9: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

PageRank• Algorithm for identifying

“important” pages

• A Web page is important if many important pages link to it

http://goo.gl/gKsQ

http://goo.gl/CsuN

9

Page 10: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Recursive Formulation of Page Rank

Yahoo!

Amazon Microsoft

The Web in 1839

Transition Matrix

1/2 1/2 0M = 1/2 0 1

0 1/2 0

Yaho

o!

Am

azon

Mic

roso

ft

Amazon

Yahoo!

Microsoft

The Matrix M, the transition matrix of the Web has element rank r, mij in row i and column j, where

1.mij = 1/r if page j has a link to page i, and there are a total of r≥1 pages that j links to

2.mij = 0 otherwise

10

Page 11: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Suppose y, a, and m represent PageRanks and fractions of the time the random walker spends

y 1/2 1/2 0 y

a = 1/2 0 1 a

m 0 1/2 0 m

2/6 1/2 1/2 0 1/3

3/6 = 1/2 0 1 1/3

1/6 0 1/2 0 1/3

5/12 1/2 1/2 0 2/6

4/12 = 1/2 0 1 3/6

3/12 0 1/2 0 1/6

After repeating the process several times:

9/24 20/48 2/5

11/24 , 17/48 , … , 2/5

4/24 11/48 1/5

Yahoo!

Amazon

Microsoft

Suggested since the probability of y+a+m=1

11

Page 12: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Spider Traps and Dead Ends

Microsoft becomes a spider trap

Yahoo!

Amazon Microsoft

Yahoo!

Amazon Microsoft

Microsoft becomes a dead end

0

0

1

Yahoo!

Amazon

Microsoft

0

0

0

Yahoo!

Amazon

Microsoft

12

Page 13: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Spider traps and dead ends solution

• Limit the time that random walker is allowed to wander at random

• Pick a constant β<1, typically in the range 0.8 to 0.9.

• Taxation rate: 1-β

• If the walker gets stuck in a spider trap, it will disappear and be replace by a new walker after few time steps

• If the walker reaches a dead end and disappears, a new walker will take over shortly

13

Page 14: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

1/2 1/2 0 1/3

Pnew = 0.8 1/2 0 0 Pold + 0.2 1/3

0 1/2 1 1/3

Yahoo!

Amazon Microsoft

Microsoft becomes a spider trap

7/33

5/33

21/33

After several iterations Yahoo!

Amazon

Microsoft

14

Page 15: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Teleport Sets

• Selected set of nodes

• Eliminate spam and pages that don’t concern to the search topic

• Nodes are selected from trusted open directories, keywords in pages on a topic, users’s bookmarks, recently searched keywords, etc.

15

Page 16: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Yahoo!

Amazon Microsoft

The Web in 1839

y 1/2 1/2 0 y 0

a = 0.8 1/2 0 1 a + 0.2 1

m 0 1/2 0 m 0

10/31

15/31

6/31

After several iterations Yahoo!

Amazon

Microsoft

Pnew = β M Pold + (1-β)t

16

Page 17: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Link Spam

• Spam farming in order to accumulate and concentrate PageRank on a few pages

• Links to the spam farm from pulicly accessible blogs, with messages like “I agree with you. See x1234.mySpam.Farm.com”

S

Links from outside

17

Page 18: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Link Spam Solution

• Compute the TrustRank of pages

• TrustRank: Topic-specific PageRank computed with a Teleport set consisting of only “trusted” pages

• Manual trusted pages collection

• User Teleports with sets of serious pages such as universities

• Compute the difference between the PageRank and TrustRank for each page. This difference is the negative TrustRank

18

Page 19: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Indices

Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2

0 1 2

the cat is fat

was raining cats and dogs

Fido the dog

Inverted IndexInverted Index

and 1

cat 0, 1

dog 1, 2

fat 0

fido 2

is 0

raining 1

the 0, 2

was 1

19

Page 20: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Inverted Indices

• Essential for Web Queries

• Uses indirect buckets for space efficiency

Buckets

cat

dog

Inverted Index

... the cat is fat ...

... was raining cats and dogs ...

... Fido the dog ...

Documents

20

Page 21: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Sorting more information in the inverted index

Type Position Document

title 5

header 10

anchor 3

text 57

title 100

title 12

Doc 1

Doc 2

Doc 3

Cat

Dog

Dogs compared with cats

21

Page 22: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Map-Reduce Parallelism Framework

• Large-scale parallel machines share high load operations such as joins

• Distributed architectures

• Grid, networks and corporate DBs

• MRP paradigm expresses large-scale computations

Map Reduce

InputKey-Value

Pairs

OutputLists

Sort IntermediateKey-Value

Pairs by Keys

Execution of map and reduce functions

22

Page 23: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Jaccard Measure of Similarity

• Finding Similar Items

• Jaccard similarity is the radio of the sizes of interaction and union the sets S and T.

|S⋂T|/|S⋃T|

{1,2,3} and {1,3,4,5} has radio 2/5

• A set of k-grams or k-Shingle is a substring of length k of a set.

“A number of …” “A n”, “ nu”, “num”, and so on.

23

Page 24: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Minhashing

• It is a technique to form a short signature for each set

• Computes the Jaccard similarity using signatures

• A minhash value of a set S is the first element of a randomly permuted universal set, that is a member of S

• Universal set of elements is {1,2,3,4,5} and a permuted order is: (3,5,4,2,1). Then, the hash value for the set {2,3,5} is 3.

24

Page 25: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Locality-Sensitive Hashing (LSH)

• Minhashing is fast but there are still too many pairs of sets

• LSH hashes sets to buckets so that “similar” elements are assigned to the same bucket

• Tradeoffs number of buckets (constrained by memory) and chances to miss a pair of similar elements

25

Page 26: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

n signatures

r rows

r bands

Buckets

Dividing signatures into bands and hashing based on the values in a band

s = (1/b) 1/r

Probability of at least one bucket in common

Similarity s

1

1

0

0

The probability that a pair of signatures will appear together in at least one bucket

26

Page 27: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Combining Minhashing and LSH

1. Compute minhash signature with as many hash functions as desired accuracy

2. Perform LSH to get candidate pairs of signatures that hash to the same bucket for at least one band

3. For each candidate pair, compute the estimate of their Jaccard similarity by counting the number of components in which their signature agree

4. Optionally, for each pair whose signatures are sufficiently similar, compute their true Jaccard similarity by examining the sets themselves

27

Page 28: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Google Apps28

Page 29: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Anatomy of a Google Search

• Uses: links, PageRank, anchors, proximity and visual presentation (e.g. bold text is weighted higher) in search logic. Search the index

1. Search the index

2. Analyze the web pages for relevance

3. Evaluate the site’s reputation

4. Rank the web pages

29

Page 31: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

Google particularities

• PageRank

• Anchor text

• Location information and use of proximity in search

• Visual presentations such as font, capitalization and size of words are weighted differently

31

Page 32: Search Engines and Google - Universitetet i oslo · Search Engines • Search engine queries are not like SQL queries • Require inverted indices • Disk access is very expensive

References

• The Anatomy of a Large-Scale Hypertextual Web Search Engine

http://infolab.stanford.edu/~backrub/google.html

• Database Systems. The Complete Book. Second Edition. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom

32