1 searching the web representation and management of data on the internet

1

Searching the WebSearching the Web

Representation and Management

of Data on the Internet

2

What does a Search Engine do?

• Processes users queries

• Finds pages with related information

• Returns a resources list

• Why can’t we use an ordinary database system

that is reachable via an ordinary Web server?

• What are the difficulties in creating a search

engine?

3

Motivation

• The web is

– Used by millions

– Contains lots of information

– Link based

– Incoherent

– Changes rapidly

– Distributed

• Traditional information retrieval was built with the

exact opposite in mind

4

The Web’s Characteristics

• Size

– Over a billion pages available (Google is a spelling of

googol = 10100)

– 5-10K per page => tens of terrabytes

– Size doubles every 2 years

• Change

– 23% change daily

– About half of the pages do not exist after 10 days

– Bowtie structure

5

Bowtie Structure

Core: Strongly

connected component

(28%)

Reachable from core

(22%)Reach the core (22%)

6

Search Engine Components

• User Interface

• Query processor

• Crawler

• Indexer

• Ranker

7

An HTML form for inserting a search query

Usually a query is a list of words

What was the most popular query in Google in the last year?

What does it mean to be popular in Google?

10

Crawling the Web

11

Basic Crawler (Spider)

Queue of Pages

removeBestPage( )

findLinksInPage( )

insertIntoQueue( )

A crawler finds Web

pages to download

into a search engine

cache

12

Choosing Pages to Download

• Q: Which pages should be downloaded?

• A: It is usually not possible to download all

pages because of space limitations. Try to

get the most important pages

• Q: When is a page important?

• A: Use a metric – by interest, by popularity,

by location, or combination

13

Interest Driven

• Suppose that there is a query Q that contains the words we

will be interested in

• Define the importance of a page P by its textual similarity to

the query Q

• Example: use a formula that combines– The number of appearances of words from Q in P

– For each word of Q how frequently does it being used (why is this

important?)

• Problem: We must decide if a page is important while

crawling. However, we don’t know how rare is a word until the

crawl is complete

• Solution: Use an estimate

14

Popularity Driven

• The importance of a page P is proportional

to the number of pages with a link to P

• This is also called the number of back links

of P

• As before, need to estimate this amount

• There is a more sophisticated metric, called

PageRank (was taught on Tuesday)

15

Location Driven

• The importance of P is a function of its URL

• Example:

– Words appearing on URL (e.g., edu or ac)

– Number of “/” on the URL

• Easily evaluated, requires no data from pervious

crawls

• Note: We can also use a combination of all three

metrics

16

Refreshing Web Pages

• Pages that have been downloaded must be

refreshed periodically

• Q: Which pages should be refreshed?

• Q: How often should we refresh a page?

17

Freshness Metric

• A cached page is fresh if it is identical to the

version on the Web

• Suppose that S is a set of pages (i.e., a

cache)

Freshness(S) =(number of fresh pages in S)

number of pages in S

18

Age Metric

• The age of a page is the number of days

since it was refreshed

• Suppose that S is a set of pages (i.e., a

cache)

Age(S) = Average age of pages in SAge(S) = Average age of pages in S

19

Refresh Goal

• Crawlers can refresh only a certain amount

of pages in a period of time

• The page download resource can be

allocated in many ways

• Goal: Minimize the age of a cache and

maximize the freshness of a cache

• We need a refresh strategy

20

Refresh Strategies

• Uniform Refresh: The crawler revisits all pages

with the same frequency, regardless of how often

they change

• Proportional Refresh: The crawler revisits a page

with frequency proportional to the page’s change

rate (i.e., if it changes more often, we visit it more

often)

Which do you think is better?

21

Trick Question• Two page database

• e1 changes daily

• e2 changes once a week

• Can visit one page per week

• How should we visit pages?

– e1 e2 e1 e2 e1 e2 e1 e2... [uniform]

– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional]

– e1 e1 e1 e1 e1 e1 ...

– e2 e2 e2 e2 e2 e2 ...

– ?

e1

e2

e1

e2

webdatabase

22

Proportional Often Not Good!

• Visit fast changing e1

get 1/2 day of freshness

• Visit slow changing e2

get 1/2 week of freshness

• Visiting e2 is a better deal!

23

Another Example

• The collection contains 2 pages: e1 changes 9

times a day, e2 changes once a day

• Simplified change model:

– Day is split into 9 equal intervals: e1 changes once on

each interval, and e2 changes once during the day

– Don’t know when the pages change within the intervals

• The crawler can download a page a day

• Our goal is to maximize the freshness

24

Which Page Do We Refresh?

• Suppose we refresh e2 in midday

• If e2 changes in first half of the day, it

remains fresh for the rest (half) of the day.

– 50% for 0.5 day freshness increase

– 50% for no increase

– Expectancy of 0.25 day freshness increase

25

Which Page Do We Refresh?

• Suppose we refresh e1 in midday

• If e1 changes in first half of the interval, and we

refresh in midday (which is the middle of the

interval), it remains fresh for the rest half of the

interval = 1/18 of a day.

– 50% for 1/18 day freshness increase

– 50% for no increase

– Expectancy of 1/36 day freshness increase

26

Not Every Page is Equal!

• Suppose that e1 is accessed twice as often

as e2

• Then, it is twice as important to us that e1 is

fresh than it is that e2 is fresh

27

Politeness Issues

• When a crawler crawls a site, it uses the site’s

resources:

– The web server needs to find the file in file system

– The web server needs to send the file in the network

• If a crawler asks for many of the pages and at a

high speed it may

– crash the sites web server or

– be banned from the site

• Solution: Ask for pages “slowly”

28

Politeness Issues (cont)

• A site may identify pages that it doesn’t want to be

crawled (how?)

• A polite crawler will not crawl these sites (although

nothing stops the crawler from being impolite)

• Put a file called robots.txt at the main directory to

identify pages that should not be crawled (e.g.,

http://www.cnn.com/robots.txt)

29

robots.txt

• Use the header User-Agent to identify

programs whose access should be restricted

• Use the header Disallow to identify pages

that should be restricted

30

Other Issues

• Suppose that a search engine uses several

crawlers at the same time (in parallel)

• How can we make sure that they are not

doing the same work (i.e., visiting the same

pages)?

31

Index Repository

32

Storage Challenges

• Scalability: Should be able to store huge amounts

of data (data spans disks or computers)

• Dual Access Mode: Random access (find specific

pages) and Streaming access (find large subsets

of pages)

• Large Batch Updates: Reclaim old space, avoid

access/update conflicts

• Obsolete Pages: Remove pages no longer on the

web (how do we find these pages?)

33

Storage Challenges

• Storage cost: Should be able to store the

huge amounts of data at a reasonable cost

(a disk that can store a few terabytes is very

expensive, so what do search engines such

as Google do?)

34

Update Strategies

• Updates are generated by the crawler

• Several characteristics

– Time in which the crawl occurs and the

repository receives information

– Whether the crawl’s information replaces the

entire database or modifies parts of it

35

Batch Crawler vs. Steady Crawler

• Batch mode

– Periodically executed

– Allocated a certain amount of time

• Steady mode

– Run all the time

– Always send results back to the repository

36

Partial vs. Complete Crawls

• A batch mode crawler can either do

– A complete crawl every run, and replace entire cache

– A partial crawl and replace only a subset of the cache

• The repository can implement

– In place update: Replaces the data in the cache, thus,

quickly refreshes pages

– Shadowing: Create a new index with updates, and later

replace the previous, thus, avoiding refresh-access

conflicts

37

Partial vs. Complete Crawls

• Shadowing resolves the conflicts between

updates and read for the queries

• Batch mode suits well with shadowing

• Steady crawler suits with in place updates

38

Types of Indices

• Content index: Allow us to easily find pages

with certain words

• Links index: Allow us to easily find links

between pages

• Utility index: Allow us to easily find pages in

certain domain, or of a certain type, etc.

• Q: What do we need these for?

39

Is the Following Content Index Good?

• Consider the table:

• We want to quickly find pages with a specific word

• Is this a good way of storing a content index?

Word Frequency UrlId

... ... ...

40

Is the Following Content Index Good? NO

• If a word appears in a thousand documents, then

the word will be in a thousand rows. Why waste the

space?

• If a word appears in a thousand documents, we will

have to access a thousand rows in order to find the

documents

• Does not easily support queries that require

multiple words

41

Inverted Keyword Index

bush: (1, 5, 11, 17) saddam: (3, 5, 11, 17)

war: (3, 5, 17, 28)

butterfly: (22, 4)

Hashtable

Words as keys

lists of matching documents as the

values

lists are sorted by urlId

42

Query: “bush saddam war”

bush: (1, 5, 11, 17)

saddam: (3, 5, 11, 17)

war: (3, 5, 17, 28)

5 17

Answers:

Algorithm:Always advance pointer(s) with lowest urlId

43

Challenges

• Index build must be :

– Fast

– Economic

• Incremental Indexing must be supported

• Tradeoff when using compression: memory

is saved but time is lost compressing and

uncompressing

44

How do we Distribute the Indices Between Files?

• Local inverted file

– Each file contains disjoint random pages of the index

– Query is broadcasted

– Result is the merged query answers

• Global inverted file

– Each file is responsible for a subset of terms in the collection

– Query “sent” only to the apropriate files

• What will happen if a disk will crash (which is better in

this case?)

45

Ranking

46

A Naïve Approach

• Let Q (the query) be a set of words

• Let countQ(P) be the number of occurrences of

words of Q in P

• A naïve approach:

– If countQ(P1) > countQ(P2) then rank P1 should be higher

than rank P2

• What are the problems with the naïve approach?

47

Testing the Naïve Approach

• Q = “green men mars”

– P1 = “I live in a green house with a green roof”

– P2 = “There is no life form on Mars”

– P3 = “Men don’t like green cars”

– P4 = “I saw some little green men yesterday”

• In what order do you think that these ‘pages’

should be returned?

48

The Vector Space Model

• The Vector Space Model (VSM) is a way of

representing documents through the words that

they contain

• It is a standard technique in Information Retrieval

• The VSM allows decisions to be made about which

documents are similar to each other and to

keyword queries

49

How Does it Work

• Each document is broken down into a word

frequency table

• The tables are called vectors and can be stored as

arrays

• A vocabulary is built from all the words in all

documents in the system

• Each document is represented as a vector based

against the vocabulary

50

Example

• Document A

– “A dog and a cat.”

• Document B

– “A frog.”

a dog and cat

2 1 1 1

a frog

1 1

51

Example (continued)

• The vocabulary contains all the words that

are used:

– a, dog, and, cat, frog

• The vocabulary is sorted

– a, and, cat, dog, frog

52

Example (continued)

• Document A: “A dog and a cat.”

– Vector: (2,1,1,1,0)

• Document B: “A frog.”

– Vector: (1,0,0,0,1)

a and cat dog frog

2 1 1 1 0

a and cat dog frog

1 0 0 0 1

53

Queries

• Queries can be represented as vectors in

the same way as documents:

– “dog” = (0,0,0,1,0)

– “frog” = (0,0,0,0,1)

– “dog and frog” = (0,1,0,1,1)

54

Similarity Measures

• There are many different ways to measure how

similar two documents are, or how similar a

document is to a query

• The cosine measure is a very common similarity

measure

• Using a similarity measure, a set of documents can

be compared to a query and the most similar

document returned

55

The Cosine Measure

• For two vectors d and d’ the cosine similarity

between d and d’ is given by:

• Here d d’ is the vector product of d and d’,

calculated by multiplying corresponding

frequencies together

• The cosine measure calculates the angle between

the vectors in a high-dimensional virtual space

'

'

dd

dd

56

Example

• Let d = (2,1,1,1,0) and d’ = (0,0,0,1,0)

– dd’ = 20 + 10 + 10 + 11 + 00 = 1

– |d| = (22+12+12+12+02) = 7=2.646

– |d’| = (02+02+02+12+02) = 1=1

– Similarity = 1/(1 2.646) = 0.378

57

Ranking Documents

• A user enters a query

• The query is compared to all documents

using a similarity measure

• The user is shown the documents in

decreasing order of similarity to the query

term

58

Vocabulary

• Stopword lists

– Commonly occurring words are unlikely to give useful

information and may be removed from the vocabulary to

speed processing

• Examples: a, and , to, is, of, in, if, would, very, when, you, …

– Stopword lists contain frequent words to be excluded

– Stopword lists need to be used carefully

• E.g. “to be or not to be”

59

Stemming

• Suppose that a user is interested in finding

pages about “running shoes”

• In many cases it is desired to return pages

containing shoe instead of shoes, and pages

containing run or runs instead of running

• In order to accommodate such variations, a

stemmer is used

60

Stemming (continued)

• A stemmer receives a keyword as input, and

returns its stem (or normal form)

• For example, the stem of running might be run

• Instead of checking whether a word w appears in a

page P, a search engine might check if there is a

word w' in P that has the same stem as w, i.e.,

stem(w)=stem(w')

61

Term Weighting

• Not all words are equally useful

• A word is most likely to be highly relevant to

document A if it is:

– Infrequent in other documents

– Frequent in document A

• The cosine measure needs to be modified to

reflect this

62

Normalised Term Frequency (tf)

• A normalised measure of the importance of a word

to a document is its frequency, divided by the

maximum frequency of any term in the document

• This is known as the tf factor

• Example:

– Given raw frequency vector: (2,1,1,1,0)

– We get the tf vector: (2/5, 1/5, 1/5, 1/5, 0)

• This stops large documents from scoring higher

63

Inverse Document Frequency (idf)

• A calculation designed to make rare words more

important than common words

• The idf of word w is given by

• Where N is the number of documents and nw is the

number of pages that contain the word w

ww n

Nidf log

64

tf-idf

• The tf-idf weighting scheme is to multiply

each word in each document by its tf factor

and idf factor

– TF-IDF(P, Q) = Sum w in Q (tf(P,w)*idf(w))

• Different schemes are usually used for query

vectors

• Different variants of tf-idf are also used

65

Traditional Ranking Faults (e.g., TF-IDF)

• Many pages containing a term may be of

poor quality or not relevant

• People put popular words in irrelevant sites

to promote the site

• Queries are short, so containing the words

from a query does not indicate importance

66

Additional Factors for Ranking

• Links: If an important page links to P, then P must

be important

• Words on links: If a page links to P with the query

keyword in the link text, the page P must really be

about the keywords

• Style of words: If a keyword appears in P in a title,

header, large font size, it is more important

67

The Hidden Web Challenge

68

The Hidden (Deep) Web

• Web pages that are protected by a password

• Web pages that require filling a registration form in

order to get them

• Web pages that are dynamically created from data

in a database (e.g., search results)

• In a weaker sense:

– Web pages that no other page has a link to them

– Pages that are not allowed for search engines (by

robots.txt)

69

One of the Challenges in Archiving the Web

• Can we reach all of the Web by crawling?

• Why do we care about parts that are not reachable

by ordinary web crawlers?

• There is an estimation that the deep web is 500

larger than the visible web

• What will be the effect of web services on the ratio

between the visible web and the hidden web?

1 searching the web representation and management of data on the internet

Documents

web slide

s slide

refreshing web pages

estimate slide

combination slide

mind slide

internet slide

important pages q