1 searching the web representation and management of data on the internet
Post on 21-Dec-2015
214 views
TRANSCRIPT
![Page 1: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/1.jpg)
1
Searching the WebSearching the Web
Representation and Management
of Data on the Internet
![Page 2: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/2.jpg)
2
What does a Search Engine do?
• Processes users queries
• Finds pages with related information
• Returns a resources list
• Why can’t we use an ordinary database system
that is reachable via an ordinary Web server?
• What are the difficulties in creating a search
engine?
![Page 3: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/3.jpg)
3
Motivation
• The web is
– Used by millions
– Contains lots of information
– Link based
– Incoherent
– Changes rapidly
– Distributed
• Traditional information retrieval was built with the
exact opposite in mind
![Page 4: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/4.jpg)
4
The Web’s Characteristics
• Size
– Over a billion pages available (Google is a spelling of
googol = 10100)
– 5-10K per page => tens of terrabytes
– Size doubles every 2 years
• Change
– 23% change daily
– About half of the pages do not exist after 10 days
– Bowtie structure
![Page 5: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/5.jpg)
5
Bowtie Structure
Core: Strongly
connected component
(28%)
Reachable from core
(22%)Reach the core (22%)
![Page 6: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/6.jpg)
6
Search Engine Components
• User Interface
• Query processor
• Crawler
• Indexer
• Ranker
![Page 7: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/7.jpg)
7
An HTML form for inserting a search query
Usually a query is a list of words
What was the most popular query in Google in the last year?
What does it mean to be popular in Google?
![Page 8: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/8.jpg)
8
![Page 9: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/9.jpg)
9
![Page 10: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/10.jpg)
10
Crawling the Web
![Page 11: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/11.jpg)
11
Basic Crawler (Spider)
Queue of Pages
removeBestPage( )
findLinksInPage( )
insertIntoQueue( )
A crawler finds Web
pages to download
into a search engine
cache
![Page 12: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/12.jpg)
12
Choosing Pages to Download
• Q: Which pages should be downloaded?
• A: It is usually not possible to download all
pages because of space limitations. Try to
get the most important pages
• Q: When is a page important?
• A: Use a metric – by interest, by popularity,
by location, or combination
![Page 13: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/13.jpg)
13
Interest Driven
• Suppose that there is a query Q that contains the words we
will be interested in
• Define the importance of a page P by its textual similarity to
the query Q
• Example: use a formula that combines– The number of appearances of words from Q in P
– For each word of Q how frequently does it being used (why is this
important?)
• Problem: We must decide if a page is important while
crawling. However, we don’t know how rare is a word until the
crawl is complete
• Solution: Use an estimate
![Page 14: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/14.jpg)
14
Popularity Driven
• The importance of a page P is proportional
to the number of pages with a link to P
• This is also called the number of back links
of P
• As before, need to estimate this amount
• There is a more sophisticated metric, called
PageRank (was taught on Tuesday)
![Page 15: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/15.jpg)
15
Location Driven
• The importance of P is a function of its URL
• Example:
– Words appearing on URL (e.g., edu or ac)
– Number of “/” on the URL
• Easily evaluated, requires no data from pervious
crawls
• Note: We can also use a combination of all three
metrics
![Page 16: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/16.jpg)
16
Refreshing Web Pages
• Pages that have been downloaded must be
refreshed periodically
• Q: Which pages should be refreshed?
• Q: How often should we refresh a page?
![Page 17: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/17.jpg)
17
Freshness Metric
• A cached page is fresh if it is identical to the
version on the Web
• Suppose that S is a set of pages (i.e., a
cache)
Freshness(S) =(number of fresh pages in S)
number of pages in S
![Page 18: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/18.jpg)
18
Age Metric
• The age of a page is the number of days
since it was refreshed
• Suppose that S is a set of pages (i.e., a
cache)
Age(S) = Average age of pages in SAge(S) = Average age of pages in S
![Page 19: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/19.jpg)
19
Refresh Goal
• Crawlers can refresh only a certain amount
of pages in a period of time
• The page download resource can be
allocated in many ways
• Goal: Minimize the age of a cache and
maximize the freshness of a cache
• We need a refresh strategy
![Page 20: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/20.jpg)
20
Refresh Strategies
• Uniform Refresh: The crawler revisits all pages
with the same frequency, regardless of how often
they change
• Proportional Refresh: The crawler revisits a page
with frequency proportional to the page’s change
rate (i.e., if it changes more often, we visit it more
often)
Which do you think is better?
![Page 21: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/21.jpg)
21
Trick Question• Two page database
• e1 changes daily
• e2 changes once a week
• Can visit one page per week
• How should we visit pages?
– e1 e2 e1 e2 e1 e2 e1 e2... [uniform]
– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional]
– e1 e1 e1 e1 e1 e1 ...
– e2 e2 e2 e2 e2 e2 ...
– ?
e1
e2
e1
e2
webdatabase
![Page 22: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/22.jpg)
22
Proportional Often Not Good!
• Visit fast changing e1
get 1/2 day of freshness
• Visit slow changing e2
get 1/2 week of freshness
• Visiting e2 is a better deal!
![Page 23: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/23.jpg)
23
Another Example
• The collection contains 2 pages: e1 changes 9
times a day, e2 changes once a day
• Simplified change model:
– Day is split into 9 equal intervals: e1 changes once on
each interval, and e2 changes once during the day
– Don’t know when the pages change within the intervals
• The crawler can download a page a day
• Our goal is to maximize the freshness
![Page 24: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/24.jpg)
24
Which Page Do We Refresh?
• Suppose we refresh e2 in midday
• If e2 changes in first half of the day, it
remains fresh for the rest (half) of the day.
– 50% for 0.5 day freshness increase
– 50% for no increase
– Expectancy of 0.25 day freshness increase
![Page 25: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/25.jpg)
25
Which Page Do We Refresh?
• Suppose we refresh e1 in midday
• If e1 changes in first half of the interval, and we
refresh in midday (which is the middle of the
interval), it remains fresh for the rest half of the
interval = 1/18 of a day.
– 50% for 1/18 day freshness increase
– 50% for no increase
– Expectancy of 1/36 day freshness increase
![Page 26: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/26.jpg)
26
Not Every Page is Equal!
• Suppose that e1 is accessed twice as often
as e2
• Then, it is twice as important to us that e1 is
fresh than it is that e2 is fresh
![Page 27: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/27.jpg)
27
Politeness Issues
• When a crawler crawls a site, it uses the site’s
resources:
– The web server needs to find the file in file system
– The web server needs to send the file in the network
• If a crawler asks for many of the pages and at a
high speed it may
– crash the sites web server or
– be banned from the site
• Solution: Ask for pages “slowly”
![Page 28: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/28.jpg)
28
Politeness Issues (cont)
• A site may identify pages that it doesn’t want to be
crawled (how?)
• A polite crawler will not crawl these sites (although
nothing stops the crawler from being impolite)
• Put a file called robots.txt at the main directory to
identify pages that should not be crawled (e.g.,
http://www.cnn.com/robots.txt)
![Page 29: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/29.jpg)
29
robots.txt
• Use the header User-Agent to identify
programs whose access should be restricted
• Use the header Disallow to identify pages
that should be restricted
![Page 30: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/30.jpg)
30
Other Issues
• Suppose that a search engine uses several
crawlers at the same time (in parallel)
• How can we make sure that they are not
doing the same work (i.e., visiting the same
pages)?
![Page 31: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/31.jpg)
31
Index Repository
![Page 32: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/32.jpg)
32
Storage Challenges
• Scalability: Should be able to store huge amounts
of data (data spans disks or computers)
• Dual Access Mode: Random access (find specific
pages) and Streaming access (find large subsets
of pages)
• Large Batch Updates: Reclaim old space, avoid
access/update conflicts
• Obsolete Pages: Remove pages no longer on the
web (how do we find these pages?)
![Page 33: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/33.jpg)
33
Storage Challenges
• Storage cost: Should be able to store the
huge amounts of data at a reasonable cost
(a disk that can store a few terabytes is very
expensive, so what do search engines such
as Google do?)
![Page 34: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/34.jpg)
34
Update Strategies
• Updates are generated by the crawler
• Several characteristics
– Time in which the crawl occurs and the
repository receives information
– Whether the crawl’s information replaces the
entire database or modifies parts of it
![Page 35: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/35.jpg)
35
Batch Crawler vs. Steady Crawler
• Batch mode
– Periodically executed
– Allocated a certain amount of time
• Steady mode
– Run all the time
– Always send results back to the repository
![Page 36: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/36.jpg)
36
Partial vs. Complete Crawls
• A batch mode crawler can either do
– A complete crawl every run, and replace entire cache
– A partial crawl and replace only a subset of the cache
• The repository can implement
– In place update: Replaces the data in the cache, thus,
quickly refreshes pages
– Shadowing: Create a new index with updates, and later
replace the previous, thus, avoiding refresh-access
conflicts
![Page 37: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/37.jpg)
37
Partial vs. Complete Crawls
• Shadowing resolves the conflicts between
updates and read for the queries
• Batch mode suits well with shadowing
• Steady crawler suits with in place updates
![Page 38: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/38.jpg)
38
Types of Indices
• Content index: Allow us to easily find pages
with certain words
• Links index: Allow us to easily find links
between pages
• Utility index: Allow us to easily find pages in
certain domain, or of a certain type, etc.
• Q: What do we need these for?
![Page 39: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/39.jpg)
39
Is the Following Content Index Good?
• Consider the table:
• We want to quickly find pages with a specific word
• Is this a good way of storing a content index?
Word Frequency UrlId
... ... ...
![Page 40: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/40.jpg)
40
Is the Following Content Index Good? NO
• If a word appears in a thousand documents, then
the word will be in a thousand rows. Why waste the
space?
• If a word appears in a thousand documents, we will
have to access a thousand rows in order to find the
documents
• Does not easily support queries that require
multiple words
![Page 41: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/41.jpg)
41
Inverted Keyword Index
bush: (1, 5, 11, 17) saddam: (3, 5, 11, 17)
war: (3, 5, 17, 28)
butterfly: (22, 4)
Hashtable
Words as keys
lists of matching documents as the
values
lists are sorted by urlId
![Page 42: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/42.jpg)
42
Query: “bush saddam war”
bush: (1, 5, 11, 17)
saddam: (3, 5, 11, 17)
war: (3, 5, 17, 28)
5 17
Answers:
Algorithm:Always advance pointer(s) with lowest urlId
![Page 43: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/43.jpg)
43
Challenges
• Index build must be :
– Fast
– Economic
• Incremental Indexing must be supported
• Tradeoff when using compression: memory
is saved but time is lost compressing and
uncompressing
![Page 44: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/44.jpg)
44
How do we Distribute the Indices Between Files?
• Local inverted file
– Each file contains disjoint random pages of the index
– Query is broadcasted
– Result is the merged query answers
• Global inverted file
– Each file is responsible for a subset of terms in the collection
– Query “sent” only to the apropriate files
• What will happen if a disk will crash (which is better in
this case?)
![Page 45: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/45.jpg)
45
Ranking
![Page 46: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/46.jpg)
46
A Naïve Approach
• Let Q (the query) be a set of words
• Let countQ(P) be the number of occurrences of
words of Q in P
• A naïve approach:
– If countQ(P1) > countQ(P2) then rank P1 should be higher
than rank P2
• What are the problems with the naïve approach?
![Page 47: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/47.jpg)
47
Testing the Naïve Approach
• Q = “green men mars”
– P1 = “I live in a green house with a green roof”
– P2 = “There is no life form on Mars”
– P3 = “Men don’t like green cars”
– P4 = “I saw some little green men yesterday”
• In what order do you think that these ‘pages’
should be returned?
![Page 48: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/48.jpg)
48
The Vector Space Model
• The Vector Space Model (VSM) is a way of
representing documents through the words that
they contain
• It is a standard technique in Information Retrieval
• The VSM allows decisions to be made about which
documents are similar to each other and to
keyword queries
![Page 49: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/49.jpg)
49
How Does it Work
• Each document is broken down into a word
frequency table
• The tables are called vectors and can be stored as
arrays
• A vocabulary is built from all the words in all
documents in the system
• Each document is represented as a vector based
against the vocabulary
![Page 50: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/50.jpg)
50
Example
• Document A
– “A dog and a cat.”
• Document B
– “A frog.”
a dog and cat
2 1 1 1
a frog
1 1
![Page 51: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/51.jpg)
51
Example (continued)
• The vocabulary contains all the words that
are used:
– a, dog, and, cat, frog
• The vocabulary is sorted
– a, and, cat, dog, frog
![Page 52: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/52.jpg)
52
Example (continued)
• Document A: “A dog and a cat.”
– Vector: (2,1,1,1,0)
• Document B: “A frog.”
– Vector: (1,0,0,0,1)
a and cat dog frog
2 1 1 1 0
a and cat dog frog
1 0 0 0 1
![Page 53: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/53.jpg)
53
Queries
• Queries can be represented as vectors in
the same way as documents:
– “dog” = (0,0,0,1,0)
– “frog” = (0,0,0,0,1)
– “dog and frog” = (0,1,0,1,1)
![Page 54: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/54.jpg)
54
Similarity Measures
• There are many different ways to measure how
similar two documents are, or how similar a
document is to a query
• The cosine measure is a very common similarity
measure
• Using a similarity measure, a set of documents can
be compared to a query and the most similar
document returned
![Page 55: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/55.jpg)
55
The Cosine Measure
• For two vectors d and d’ the cosine similarity
between d and d’ is given by:
• Here d d’ is the vector product of d and d’,
calculated by multiplying corresponding
frequencies together
• The cosine measure calculates the angle between
the vectors in a high-dimensional virtual space
'
'
dd
dd
![Page 56: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/56.jpg)
56
Example
• Let d = (2,1,1,1,0) and d’ = (0,0,0,1,0)
– dd’ = 20 + 10 + 10 + 11 + 00 = 1
– |d| = (22+12+12+12+02) = 7=2.646
– |d’| = (02+02+02+12+02) = 1=1
– Similarity = 1/(1 2.646) = 0.378
![Page 57: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/57.jpg)
57
Ranking Documents
• A user enters a query
• The query is compared to all documents
using a similarity measure
• The user is shown the documents in
decreasing order of similarity to the query
term
![Page 58: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/58.jpg)
58
Vocabulary
• Stopword lists
– Commonly occurring words are unlikely to give useful
information and may be removed from the vocabulary to
speed processing
• Examples: a, and , to, is, of, in, if, would, very, when, you, …
– Stopword lists contain frequent words to be excluded
– Stopword lists need to be used carefully
• E.g. “to be or not to be”
![Page 59: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/59.jpg)
59
Stemming
• Suppose that a user is interested in finding
pages about “running shoes”
• In many cases it is desired to return pages
containing shoe instead of shoes, and pages
containing run or runs instead of running
• In order to accommodate such variations, a
stemmer is used
![Page 60: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/60.jpg)
60
Stemming (continued)
• A stemmer receives a keyword as input, and
returns its stem (or normal form)
• For example, the stem of running might be run
• Instead of checking whether a word w appears in a
page P, a search engine might check if there is a
word w' in P that has the same stem as w, i.e.,
stem(w)=stem(w')
![Page 61: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/61.jpg)
61
Term Weighting
• Not all words are equally useful
• A word is most likely to be highly relevant to
document A if it is:
– Infrequent in other documents
– Frequent in document A
• The cosine measure needs to be modified to
reflect this
![Page 62: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/62.jpg)
62
Normalised Term Frequency (tf)
• A normalised measure of the importance of a word
to a document is its frequency, divided by the
maximum frequency of any term in the document
• This is known as the tf factor
• Example:
– Given raw frequency vector: (2,1,1,1,0)
– We get the tf vector: (2/5, 1/5, 1/5, 1/5, 0)
• This stops large documents from scoring higher
![Page 63: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/63.jpg)
63
Inverse Document Frequency (idf)
• A calculation designed to make rare words more
important than common words
• The idf of word w is given by
• Where N is the number of documents and nw is the
number of pages that contain the word w
ww n
Nidf log
![Page 64: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/64.jpg)
64
tf-idf
• The tf-idf weighting scheme is to multiply
each word in each document by its tf factor
and idf factor
– TF-IDF(P, Q) = Sum w in Q (tf(P,w)*idf(w))
• Different schemes are usually used for query
vectors
• Different variants of tf-idf are also used
![Page 65: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/65.jpg)
65
Traditional Ranking Faults (e.g., TF-IDF)
• Many pages containing a term may be of
poor quality or not relevant
• People put popular words in irrelevant sites
to promote the site
• Queries are short, so containing the words
from a query does not indicate importance
![Page 66: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/66.jpg)
66
Additional Factors for Ranking
• Links: If an important page links to P, then P must
be important
• Words on links: If a page links to P with the query
keyword in the link text, the page P must really be
about the keywords
• Style of words: If a keyword appears in P in a title,
header, large font size, it is more important
![Page 67: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/67.jpg)
67
The Hidden Web Challenge
![Page 68: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/68.jpg)
68
The Hidden (Deep) Web
• Web pages that are protected by a password
• Web pages that require filling a registration form in
order to get them
• Web pages that are dynamically created from data
in a database (e.g., search results)
• In a weaker sense:
– Web pages that no other page has a link to them
– Pages that are not allowed for search engines (by
robots.txt)
![Page 69: 1 Searching the Web Representation and Management of Data on the Internet](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d5e5503460f94a3e2aa/html5/thumbnails/69.jpg)
69
One of the Challenges in Archiving the Web
• Can we reach all of the Web by crawling?
• Why do we care about parts that are not reachable
by ordinary web crawlers?
• There is an estimation that the deep web is 500
larger than the visible web
• What will be the effect of web services on the ratio
between the visible web and the hidden web?