1 searching the web representation and management of data on the internet

69
1 Searching the Web Representation and Management of Data on the Internet

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Searching the Web Representation and Management of Data on the Internet

1

Searching the WebSearching the Web

Representation and Management

of Data on the Internet

Page 2: 1 Searching the Web Representation and Management of Data on the Internet

2

What does a Search Engine do?

• Processes users queries

• Finds pages with related information

• Returns a resources list

• Why can’t we use an ordinary database system

that is reachable via an ordinary Web server?

• What are the difficulties in creating a search

engine?

Page 3: 1 Searching the Web Representation and Management of Data on the Internet

3

Motivation

• The web is

– Used by millions

– Contains lots of information

– Link based

– Incoherent

– Changes rapidly

– Distributed

• Traditional information retrieval was built with the

exact opposite in mind

Page 4: 1 Searching the Web Representation and Management of Data on the Internet

4

The Web’s Characteristics

• Size

– Over a billion pages available (Google is a spelling of

googol = 10100)

– 5-10K per page => tens of terrabytes

– Size doubles every 2 years

• Change

– 23% change daily

– About half of the pages do not exist after 10 days

– Bowtie structure

Page 5: 1 Searching the Web Representation and Management of Data on the Internet

5

Bowtie Structure

Core: Strongly

connected component

(28%)

Reachable from core

(22%)Reach the core (22%)

Page 6: 1 Searching the Web Representation and Management of Data on the Internet

6

Search Engine Components

• User Interface

• Query processor

• Crawler

• Indexer

• Ranker

Page 7: 1 Searching the Web Representation and Management of Data on the Internet

7

An HTML form for inserting a search query

Usually a query is a list of words

What was the most popular query in Google in the last year?

What does it mean to be popular in Google?

Page 8: 1 Searching the Web Representation and Management of Data on the Internet

8

Page 9: 1 Searching the Web Representation and Management of Data on the Internet

9

Page 10: 1 Searching the Web Representation and Management of Data on the Internet

10

Crawling the Web

Page 11: 1 Searching the Web Representation and Management of Data on the Internet

11

Basic Crawler (Spider)

Queue of Pages

removeBestPage( )

findLinksInPage( )

insertIntoQueue( )

A crawler finds Web

pages to download

into a search engine

cache

Page 12: 1 Searching the Web Representation and Management of Data on the Internet

12

Choosing Pages to Download

• Q: Which pages should be downloaded?

• A: It is usually not possible to download all

pages because of space limitations. Try to

get the most important pages

• Q: When is a page important?

• A: Use a metric – by interest, by popularity,

by location, or combination

Page 13: 1 Searching the Web Representation and Management of Data on the Internet

13

Interest Driven

• Suppose that there is a query Q that contains the words we

will be interested in

• Define the importance of a page P by its textual similarity to

the query Q

• Example: use a formula that combines– The number of appearances of words from Q in P

– For each word of Q how frequently does it being used (why is this

important?)

• Problem: We must decide if a page is important while

crawling. However, we don’t know how rare is a word until the

crawl is complete

• Solution: Use an estimate

Page 14: 1 Searching the Web Representation and Management of Data on the Internet

14

Popularity Driven

• The importance of a page P is proportional

to the number of pages with a link to P

• This is also called the number of back links

of P

• As before, need to estimate this amount

• There is a more sophisticated metric, called

PageRank (was taught on Tuesday)

Page 15: 1 Searching the Web Representation and Management of Data on the Internet

15

Location Driven

• The importance of P is a function of its URL

• Example:

– Words appearing on URL (e.g., edu or ac)

– Number of “/” on the URL

• Easily evaluated, requires no data from pervious

crawls

• Note: We can also use a combination of all three

metrics

Page 16: 1 Searching the Web Representation and Management of Data on the Internet

16

Refreshing Web Pages

• Pages that have been downloaded must be

refreshed periodically

• Q: Which pages should be refreshed?

• Q: How often should we refresh a page?

Page 17: 1 Searching the Web Representation and Management of Data on the Internet

17

Freshness Metric

• A cached page is fresh if it is identical to the

version on the Web

• Suppose that S is a set of pages (i.e., a

cache)

Freshness(S) =(number of fresh pages in S)

number of pages in S

Page 18: 1 Searching the Web Representation and Management of Data on the Internet

18

Age Metric

• The age of a page is the number of days

since it was refreshed

• Suppose that S is a set of pages (i.e., a

cache)

Age(S) = Average age of pages in SAge(S) = Average age of pages in S

Page 19: 1 Searching the Web Representation and Management of Data on the Internet

19

Refresh Goal

• Crawlers can refresh only a certain amount

of pages in a period of time

• The page download resource can be

allocated in many ways

• Goal: Minimize the age of a cache and

maximize the freshness of a cache

• We need a refresh strategy

Page 20: 1 Searching the Web Representation and Management of Data on the Internet

20

Refresh Strategies

• Uniform Refresh: The crawler revisits all pages

with the same frequency, regardless of how often

they change

• Proportional Refresh: The crawler revisits a page

with frequency proportional to the page’s change

rate (i.e., if it changes more often, we visit it more

often)

Which do you think is better?

Page 21: 1 Searching the Web Representation and Management of Data on the Internet

21

Trick Question• Two page database

• e1 changes daily

• e2 changes once a week

• Can visit one page per week

• How should we visit pages?

– e1 e2 e1 e2 e1 e2 e1 e2... [uniform]

– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional]

– e1 e1 e1 e1 e1 e1 ...

– e2 e2 e2 e2 e2 e2 ...

– ?

e1

e2

e1

e2

webdatabase

Page 22: 1 Searching the Web Representation and Management of Data on the Internet

22

Proportional Often Not Good!

• Visit fast changing e1

get 1/2 day of freshness

• Visit slow changing e2

get 1/2 week of freshness

• Visiting e2 is a better deal!

Page 23: 1 Searching the Web Representation and Management of Data on the Internet

23

Another Example

• The collection contains 2 pages: e1 changes 9

times a day, e2 changes once a day

• Simplified change model:

– Day is split into 9 equal intervals: e1 changes once on

each interval, and e2 changes once during the day

– Don’t know when the pages change within the intervals

• The crawler can download a page a day

• Our goal is to maximize the freshness

Page 24: 1 Searching the Web Representation and Management of Data on the Internet

24

Which Page Do We Refresh?

• Suppose we refresh e2 in midday

• If e2 changes in first half of the day, it

remains fresh for the rest (half) of the day.

– 50% for 0.5 day freshness increase

– 50% for no increase

– Expectancy of 0.25 day freshness increase

Page 25: 1 Searching the Web Representation and Management of Data on the Internet

25

Which Page Do We Refresh?

• Suppose we refresh e1 in midday

• If e1 changes in first half of the interval, and we

refresh in midday (which is the middle of the

interval), it remains fresh for the rest half of the

interval = 1/18 of a day.

– 50% for 1/18 day freshness increase

– 50% for no increase

– Expectancy of 1/36 day freshness increase

Page 26: 1 Searching the Web Representation and Management of Data on the Internet

26

Not Every Page is Equal!

• Suppose that e1 is accessed twice as often

as e2

• Then, it is twice as important to us that e1 is

fresh than it is that e2 is fresh

Page 27: 1 Searching the Web Representation and Management of Data on the Internet

27

Politeness Issues

• When a crawler crawls a site, it uses the site’s

resources:

– The web server needs to find the file in file system

– The web server needs to send the file in the network

• If a crawler asks for many of the pages and at a

high speed it may

– crash the sites web server or

– be banned from the site

• Solution: Ask for pages “slowly”

Page 28: 1 Searching the Web Representation and Management of Data on the Internet

28

Politeness Issues (cont)

• A site may identify pages that it doesn’t want to be

crawled (how?)

• A polite crawler will not crawl these sites (although

nothing stops the crawler from being impolite)

• Put a file called robots.txt at the main directory to

identify pages that should not be crawled (e.g.,

http://www.cnn.com/robots.txt)

Page 29: 1 Searching the Web Representation and Management of Data on the Internet

29

robots.txt

• Use the header User-Agent to identify

programs whose access should be restricted

• Use the header Disallow to identify pages

that should be restricted

Page 30: 1 Searching the Web Representation and Management of Data on the Internet

30

Other Issues

• Suppose that a search engine uses several

crawlers at the same time (in parallel)

• How can we make sure that they are not

doing the same work (i.e., visiting the same

pages)?

Page 31: 1 Searching the Web Representation and Management of Data on the Internet

31

Index Repository

Page 32: 1 Searching the Web Representation and Management of Data on the Internet

32

Storage Challenges

• Scalability: Should be able to store huge amounts

of data (data spans disks or computers)

• Dual Access Mode: Random access (find specific

pages) and Streaming access (find large subsets

of pages)

• Large Batch Updates: Reclaim old space, avoid

access/update conflicts

• Obsolete Pages: Remove pages no longer on the

web (how do we find these pages?)

Page 33: 1 Searching the Web Representation and Management of Data on the Internet

33

Storage Challenges

• Storage cost: Should be able to store the

huge amounts of data at a reasonable cost

(a disk that can store a few terabytes is very

expensive, so what do search engines such

as Google do?)

Page 34: 1 Searching the Web Representation and Management of Data on the Internet

34

Update Strategies

• Updates are generated by the crawler

• Several characteristics

– Time in which the crawl occurs and the

repository receives information

– Whether the crawl’s information replaces the

entire database or modifies parts of it

Page 35: 1 Searching the Web Representation and Management of Data on the Internet

35

Batch Crawler vs. Steady Crawler

• Batch mode

– Periodically executed

– Allocated a certain amount of time

• Steady mode

– Run all the time

– Always send results back to the repository

Page 36: 1 Searching the Web Representation and Management of Data on the Internet

36

Partial vs. Complete Crawls

• A batch mode crawler can either do

– A complete crawl every run, and replace entire cache

– A partial crawl and replace only a subset of the cache

• The repository can implement

– In place update: Replaces the data in the cache, thus,

quickly refreshes pages

– Shadowing: Create a new index with updates, and later

replace the previous, thus, avoiding refresh-access

conflicts

Page 37: 1 Searching the Web Representation and Management of Data on the Internet

37

Partial vs. Complete Crawls

• Shadowing resolves the conflicts between

updates and read for the queries

• Batch mode suits well with shadowing

• Steady crawler suits with in place updates

Page 38: 1 Searching the Web Representation and Management of Data on the Internet

38

Types of Indices

• Content index: Allow us to easily find pages

with certain words

• Links index: Allow us to easily find links

between pages

• Utility index: Allow us to easily find pages in

certain domain, or of a certain type, etc.

• Q: What do we need these for?

Page 39: 1 Searching the Web Representation and Management of Data on the Internet

39

Is the Following Content Index Good?

• Consider the table:

• We want to quickly find pages with a specific word

• Is this a good way of storing a content index?

Word Frequency UrlId

... ... ...

Page 40: 1 Searching the Web Representation and Management of Data on the Internet

40

Is the Following Content Index Good? NO

• If a word appears in a thousand documents, then

the word will be in a thousand rows. Why waste the

space?

• If a word appears in a thousand documents, we will

have to access a thousand rows in order to find the

documents

• Does not easily support queries that require

multiple words

Page 41: 1 Searching the Web Representation and Management of Data on the Internet

41

Inverted Keyword Index

bush: (1, 5, 11, 17) saddam: (3, 5, 11, 17)

war: (3, 5, 17, 28)

butterfly: (22, 4)

Hashtable

Words as keys

lists of matching documents as the

values

lists are sorted by urlId

Page 42: 1 Searching the Web Representation and Management of Data on the Internet

42

Query: “bush saddam war”

bush: (1, 5, 11, 17)

saddam: (3, 5, 11, 17)

war: (3, 5, 17, 28)

5 17

Answers:

Algorithm:Always advance pointer(s) with lowest urlId

Page 43: 1 Searching the Web Representation and Management of Data on the Internet

43

Challenges

• Index build must be :

– Fast

– Economic

• Incremental Indexing must be supported

• Tradeoff when using compression: memory

is saved but time is lost compressing and

uncompressing

Page 44: 1 Searching the Web Representation and Management of Data on the Internet

44

How do we Distribute the Indices Between Files?

• Local inverted file

– Each file contains disjoint random pages of the index

– Query is broadcasted

– Result is the merged query answers

• Global inverted file

– Each file is responsible for a subset of terms in the collection

– Query “sent” only to the apropriate files

• What will happen if a disk will crash (which is better in

this case?)

Page 45: 1 Searching the Web Representation and Management of Data on the Internet

45

Ranking

Page 46: 1 Searching the Web Representation and Management of Data on the Internet

46

A Naïve Approach

• Let Q (the query) be a set of words

• Let countQ(P) be the number of occurrences of

words of Q in P

• A naïve approach:

– If countQ(P1) > countQ(P2) then rank P1 should be higher

than rank P2

• What are the problems with the naïve approach?

Page 47: 1 Searching the Web Representation and Management of Data on the Internet

47

Testing the Naïve Approach

• Q = “green men mars”

– P1 = “I live in a green house with a green roof”

– P2 = “There is no life form on Mars”

– P3 = “Men don’t like green cars”

– P4 = “I saw some little green men yesterday”

• In what order do you think that these ‘pages’

should be returned?

Page 48: 1 Searching the Web Representation and Management of Data on the Internet

48

The Vector Space Model

• The Vector Space Model (VSM) is a way of

representing documents through the words that

they contain

• It is a standard technique in Information Retrieval

• The VSM allows decisions to be made about which

documents are similar to each other and to

keyword queries

Page 49: 1 Searching the Web Representation and Management of Data on the Internet

49

How Does it Work

• Each document is broken down into a word

frequency table

• The tables are called vectors and can be stored as

arrays

• A vocabulary is built from all the words in all

documents in the system

• Each document is represented as a vector based

against the vocabulary

Page 50: 1 Searching the Web Representation and Management of Data on the Internet

50

Example

• Document A

– “A dog and a cat.”

• Document B

– “A frog.”

a dog and cat

2 1 1 1

a frog

1 1

Page 51: 1 Searching the Web Representation and Management of Data on the Internet

51

Example (continued)

• The vocabulary contains all the words that

are used:

– a, dog, and, cat, frog

• The vocabulary is sorted

– a, and, cat, dog, frog

Page 52: 1 Searching the Web Representation and Management of Data on the Internet

52

Example (continued)

• Document A: “A dog and a cat.”

– Vector: (2,1,1,1,0)

• Document B: “A frog.”

– Vector: (1,0,0,0,1)

a and cat dog frog

2 1 1 1 0

a and cat dog frog

1 0 0 0 1

Page 53: 1 Searching the Web Representation and Management of Data on the Internet

53

Queries

• Queries can be represented as vectors in

the same way as documents:

– “dog” = (0,0,0,1,0)

– “frog” = (0,0,0,0,1)

– “dog and frog” = (0,1,0,1,1)

Page 54: 1 Searching the Web Representation and Management of Data on the Internet

54

Similarity Measures

• There are many different ways to measure how

similar two documents are, or how similar a

document is to a query

• The cosine measure is a very common similarity

measure

• Using a similarity measure, a set of documents can

be compared to a query and the most similar

document returned

Page 55: 1 Searching the Web Representation and Management of Data on the Internet

55

The Cosine Measure

• For two vectors d and d’ the cosine similarity

between d and d’ is given by:

• Here d d’ is the vector product of d and d’,

calculated by multiplying corresponding

frequencies together

• The cosine measure calculates the angle between

the vectors in a high-dimensional virtual space

'

'

dd

dd

Page 56: 1 Searching the Web Representation and Management of Data on the Internet

56

Example

• Let d = (2,1,1,1,0) and d’ = (0,0,0,1,0)

– dd’ = 20 + 10 + 10 + 11 + 00 = 1

– |d| = (22+12+12+12+02) = 7=2.646

– |d’| = (02+02+02+12+02) = 1=1

– Similarity = 1/(1 2.646) = 0.378

Page 57: 1 Searching the Web Representation and Management of Data on the Internet

57

Ranking Documents

• A user enters a query

• The query is compared to all documents

using a similarity measure

• The user is shown the documents in

decreasing order of similarity to the query

term

Page 58: 1 Searching the Web Representation and Management of Data on the Internet

58

Vocabulary

• Stopword lists

– Commonly occurring words are unlikely to give useful

information and may be removed from the vocabulary to

speed processing

• Examples: a, and , to, is, of, in, if, would, very, when, you, …

– Stopword lists contain frequent words to be excluded

– Stopword lists need to be used carefully

• E.g. “to be or not to be”

Page 59: 1 Searching the Web Representation and Management of Data on the Internet

59

Stemming

• Suppose that a user is interested in finding

pages about “running shoes”

• In many cases it is desired to return pages

containing shoe instead of shoes, and pages

containing run or runs instead of running

• In order to accommodate such variations, a

stemmer is used

Page 60: 1 Searching the Web Representation and Management of Data on the Internet

60

Stemming (continued)

• A stemmer receives a keyword as input, and

returns its stem (or normal form)

• For example, the stem of running might be run

• Instead of checking whether a word w appears in a

page P, a search engine might check if there is a

word w' in P that has the same stem as w, i.e.,

stem(w)=stem(w')

Page 61: 1 Searching the Web Representation and Management of Data on the Internet

61

Term Weighting

• Not all words are equally useful

• A word is most likely to be highly relevant to

document A if it is:

– Infrequent in other documents

– Frequent in document A

• The cosine measure needs to be modified to

reflect this

Page 62: 1 Searching the Web Representation and Management of Data on the Internet

62

Normalised Term Frequency (tf)

• A normalised measure of the importance of a word

to a document is its frequency, divided by the

maximum frequency of any term in the document

• This is known as the tf factor

• Example:

– Given raw frequency vector: (2,1,1,1,0)

– We get the tf vector: (2/5, 1/5, 1/5, 1/5, 0)

• This stops large documents from scoring higher

Page 63: 1 Searching the Web Representation and Management of Data on the Internet

63

Inverse Document Frequency (idf)

• A calculation designed to make rare words more

important than common words

• The idf of word w is given by

• Where N is the number of documents and nw is the

number of pages that contain the word w

ww n

Nidf log

Page 64: 1 Searching the Web Representation and Management of Data on the Internet

64

tf-idf

• The tf-idf weighting scheme is to multiply

each word in each document by its tf factor

and idf factor

– TF-IDF(P, Q) = Sum w in Q (tf(P,w)*idf(w))

• Different schemes are usually used for query

vectors

• Different variants of tf-idf are also used

Page 65: 1 Searching the Web Representation and Management of Data on the Internet

65

Traditional Ranking Faults (e.g., TF-IDF)

• Many pages containing a term may be of

poor quality or not relevant

• People put popular words in irrelevant sites

to promote the site

• Queries are short, so containing the words

from a query does not indicate importance

Page 66: 1 Searching the Web Representation and Management of Data on the Internet

66

Additional Factors for Ranking

• Links: If an important page links to P, then P must

be important

• Words on links: If a page links to P with the query

keyword in the link text, the page P must really be

about the keywords

• Style of words: If a keyword appears in P in a title,

header, large font size, it is more important

Page 67: 1 Searching the Web Representation and Management of Data on the Internet

67

The Hidden Web Challenge

Page 68: 1 Searching the Web Representation and Management of Data on the Internet

68

The Hidden (Deep) Web

• Web pages that are protected by a password

• Web pages that require filling a registration form in

order to get them

• Web pages that are dynamically created from data

in a database (e.g., search results)

• In a weaker sense:

– Web pages that no other page has a link to them

– Pages that are not allowed for search engines (by

robots.txt)

Page 69: 1 Searching the Web Representation and Management of Data on the Internet

69

One of the Challenges in Archiving the Web

• Can we reach all of the Web by crawling?

• Why do we care about parts that are not reachable

by ordinary web crawlers?

• There is an estimation that the deep web is 500

larger than the visible web

• What will be the effect of web services on the ratio

between the visible web and the hidden web?