information retrieval implementation issues
DESCRIPTION
Information Retrieval Implementation issues. Djoerd Hiemstra & Vojkan Mihajlovic University of Twente {d.hiemstra,v.mihajlovic}.utwente.nl. The lecture. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/1.jpg)
Information RetrievalImplementation issues
Djoerd Hiemstra & Vojkan Mihajlovic
University of Twente
{d.hiemstra,v.mihajlovic}.utwente.nl
![Page 2: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/2.jpg)
The lecture
• Ian H. Witten, Alistar Mofat, Timothy C. Bell, “Managing Gigabytes”, Morgan Kaufmann, pages 72-115 (Section 3), 1999. (For the exam, the compression methods in Section 3.3., i.e., the part with the grey bar left of the text, does not have to be studies in detal)
• Sergey Brin and Lawrence Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Computer Networks and ISDN Systems, 1997.
![Page 3: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/3.jpg)
Overview
• Brute force implementation
• Text analysis
• Indexing
• Index coding and query processing
• Web search engines
• Wrap-up
![Page 4: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/4.jpg)
Overview
• Brute force implementation
• Text analysis
• Indexing
• Index coding and query processing
• Web search engines
• Wrap-up
![Page 5: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/5.jpg)
Architecture 2000
FAST search Engine
Knut Risvik
![Page 6: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/6.jpg)
Architecture today
1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book - it tells which pages contain the words that match the query.
2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result.
3. The search results are returned to the user in a fraction of a second.
![Page 7: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/7.jpg)
Storing the web
• More than 10 billion sites• Assume each site contains 1000 terms• Each term consists of 5 chars on average• Each term a UTF character >=2bytes• To store the web you need to search:
– 1010 x 103 x 5 x 2B ~= 100TB
• What about: term statistics, hypertext info, pointers, search indexes, etc.? ~= PB
• Do we really need all this data?
![Page 8: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/8.jpg)
Counting the web
• Text statistics:– Term frequency– Collection frequency– Inverse document frequency …
• Hypertext statistics:– Ingoing and outgoing links– Anchor text– Term positions, proximities, sizes, and
characteristics …
![Page 9: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/9.jpg)
Searching the web
• 100TB of data to be searched• We need to find such a large hard disk
(currently the biggest are 250GB)• Hard disk transfer time 100MB/s• Time needed to sequentially scan the
data: 1 million seconds• We have to wait for 10 days to get the
answer to a query• That is not all …
![Page 10: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/10.jpg)
Problems in web search
• Web crawling– Deal with limits, freshness, duplicates, missing links,
loops, server problems, virtual hosts, etc.
• Maintain large cluster of servers– Page servers: store and deliver the results of the
queries– Index servers: resolve the queries
• Answer 250 million of user queries per day– Caching, replicating, parallel processing, etc.– Indexing, compression, coding, fast access, etc.
![Page 11: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/11.jpg)
Implementation issues
• Analyze the collection– Avoid non-informative data for indexing– Decision on relevant statistics and info
• Index the collection– Which index type to use?– How to organize the index?
• Compress the data– Data compression– Index compression
![Page 12: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/12.jpg)
Overview
• Brute force implementation
• Text analysis
• Indexing
• Index coding and query processing
• Web search engines
• Wrap-up
![Page 13: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/13.jpg)
Term frequency
• Count how many times a tem occur in the collection (size N terms) => frequency (f)
• Order them in descending order => rank (r)
• The product of the frequency of words and their rank is approximately constant: f x r = C, C ~= N/10
![Page 14: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/14.jpg)
Zipf distribution
Linear scale Logarithmic scale
Terms by rank order Terms by rank order
Termcount
Termcount
![Page 15: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/15.jpg)
Consequences
• Few terms occur very frequently: a, an, the, … => non-informative (stop) words
• Many terms occur very infrequently: spelling mistakes, foreign names, … => noise
• Medium number of terms occur with medium frequency => useful
![Page 16: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/16.jpg)
Word resolving power
(van Rijsbergen 79)
![Page 17: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/17.jpg)
Heap’s law for dictionary size
collection size
number ofunique terms
![Page 18: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/18.jpg)
Let’s store the web
• Let’s remove:– Stop words: N/10 + N/20 + …– Noise words ~ N/1000– UTF => ASCII
• To store the web you need:– to use ~ 4/5 of the terms– 4/5 x 1010 x 103 x 5 x 1B ~= 40TB
• How to search this vast amount of data?
![Page 19: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/19.jpg)
Overview
• Brute force implementation
• Text analysis
• Indexing
• Index coding and query processing
• Web search engines
• Wrap-up
![Page 20: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/20.jpg)
Indexing
• How would you index the web?• Document index• Inverted index• Postings• Statistical information• Evaluating a query• Can we really search the web index?• Bitmaps and signature files
![Page 21: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/21.jpg)
Example
Document number Text
1 Pease porridge hot, pease porridge cold
2 Pease porridge in the pot
3 Nine days old
4 Some like it hot, some like it cold
5 Some like it in the pot
6 Nine days old
Stop words: in, the, it.
![Page 22: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/22.jpg)
Document index
Doc. Id
cold days hot like nine old pease porridge pot some
1 1 1 2 2
2 1 1 1
3 1 1 1
4 1 1 2 2
5 1 1 1
6 1 1 1
5B 1B 1B 1B 1B 1B 1B 1B 1B 1B 1B
#docs x [log2#docs] + #u_terms x #docs x 8b + #u_terms x (5 x 8b + [log2#u_terms])
1010 x 5B + 106 x 1010 x 1B + 106 x (5 x 1B + 4B) ~= 10PB
![Page 23: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/23.jpg)
Inverted index (1)term doc. id
cold 1
hot 1
pease 1
pease 1
porridge 1
porridge 1
pease 2
porridge 2
pot 2
days 3
nine 3
old 3
4B 1B
1013 x (4B + 5B) + 106 x (5 x 1B + 4B)
= 90TB
term doc. id
cold 4
hot 4
like 4
like 4
some 4
some 4
like 5
pot 5
some 5
days 6
nine 6
old 6
4B 1B
![Page 24: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/24.jpg)
Inverted index (2)term tf doc. id
cold 1 1
hot 1 1
pease 2 1
porridge 2 1
pease 1 2
porridge 1 2
pot 1 2
days 1 3
nine 1 3
old 1 3
4B 1B 5B
500 x 1010 x (4B + 1B + 5B) + 106 x (5 x 1B + 4B) = 50TB
term tf doc. id
cold 1 4
hot 1 4
like 2 4
some 2 4
like 1 5
pot 1 5
some 1 5
days 1 6
nine 1 6
old 1 6
cold 1 4
hot 1 4
4B 1B 5B
![Page 25: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/25.jpg)
Inverted index - Postings
term num. doc pointer
cold 2 ->
days 2 ->
hot 2 ->
like 2 ->
nine 2 ->
old 2 ->
pease 2 ->
porridge
2 ->
pot 2 ->
some 2 ->
5 x 1B 5B 5B
500 x 1010 x (5B + 1B) + 106 x (5 x 1B + 5B + 5B) = 30TB + 15MB < 40TB
doc. num tf
1 1
4 1
3 1
6 1
1 1
4 1
4 2
. .
. .
. .
4 2
5 1
5B 1B
![Page 26: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/26.jpg)
Inverted index - Statistics
500 x 1010 x (5B + 1B) + 106 x (5 x 1B + 5B + 5B + 5B) = 30TB + 20MB
term cf num. doc pointer
cold 2 2 ->
days 2 2 ->
hot 2 2 ->
like 3 2 ->
nine 2 2 ->
old 2 2 ->
pease 3 2 ->
porridge 3 2 ->
pot 2 2 ->
some 3 2 ->
5 x 1B 2B 5B 5B
doc. num tf
1 1
4 1
3 1
6 1
1 1
4 1
4 2
. .
. .
. .
4 2
5 1
5B 1B
![Page 27: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/27.jpg)
Inverted index querying
Cold and hot => doc1,doc4; score = 1/6 x 1/2 x 1/6 x 1/2 = 1/144
term cf num. doc pointer
cold 2 2 ->
days 2 2 ->
hot 2 2 ->
like 3 2 ->
nine 2 2 ->
old 2 2 ->
pease 3 2 ->
porridge 3 2 ->
pot 2 2 ->
some 3 2 ->
5 x 1B 2B 5B 5B
doc. num tf
1 1
4 1
3 1
6 1
1 1
4 1
4 2
. .
. .
. .
4 2
5 1
5B 1B
![Page 28: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/28.jpg)
Break: can we search the web?
• Number of postings (term-document pairs):– Number of documents: ~1010,– Average number of unique terms per document
(document size ~1000): ~500• Number of unique terms: ~106
• Formula: #docs x avg_tpd x ([log2#docs] + [log2max(tf)])
– + #u_trm(5 x [log2#char_size] + [log2N/10] + [log2#docs/10] + [log2(#docs x avg_tpd)])
• Can we still make the search more efficient?– Yes, but let’s take a look at other indexing techniques
500 x 1010 x (5B + 1B) + 106 x (5 x 1B + 5B + 5B + 5B) = 3 x 1013 + 2 x 107 ~= 30TB
![Page 29: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/29.jpg)
Bitmaps
• For every term in the dictionary a bitvector is stored
• Each bit represent presence or absence of a term in a document
• Cold and pease => 100100 & 110000 = 100000
term bitvector
cold 100100
days 001001
hot 100100
like 000110
old 001001
nine 001001
pease 110000
porridge 110000
pot 010010
some 000110
5 x 1B 1GB
106 x (1GB + 5 x 1B) = 1PB
![Page 30: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/30.jpg)
Signature files
• A text index based on storing a signature for each text block to be able to filter out some blocks quickly
• A probabilistic method for indexing text
• k hash functions are generating n-bit values
• Signatures of two words can be identical
![Page 31: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/31.jpg)
Signature file example
term hash string
cold 1000 0000 0010 0100
days 0010 0100 0000 1000
hot 0000 1010 0000 0000
like 0100 0010 0000 0001
nine 0010 1000 0000 0100
old 1000 1000 0100 0000
pease 0100 0100 0010 0000
porridge 0100 0100 0010 0000
pot 0000 0010 0110 0000
some 0100 0100 0000 0001
nr. signature Text
1 1100 1111 0010 0101 Pease porridge hot, please porridge cold.
2 1110 1111 0110 0001 Pease porridge in the
pot.
3 1010 1100 0100 1100 Nine days old.
4 1100 1110 1010 0111 Some like it hot, some like it cold
5 1110 1111 1110 0011 Some like it in the pot
6 1010 1100 0100 1100 Nine days old
![Page 32: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/32.jpg)
Signature file searching
• If the corresponding word signature bits are set in the document, there is a high probability that the document contains the word.
• Cold (1 & 4) => OK• Old (2,3,5 & 6) => not OK (2 & 5) => fetch the
document at a query time and check if it occurs• Cold & hot: 1000 1010 0010 0100 (1 & 4) => OK• Reduce the false hits by increasing the number
of bits per term signature1010 x (5B + 1KB) = 10PB
![Page 33: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/33.jpg)
Indexing - Recap
• Inverted files– require less storage than other two– more robust for ranked retrieval– can be extended for phrase/proximity search– numerous techniques exist for speed & storage space
reduction
• Bitmaps– an order of magnitude more storage than inverted
files– efficient for Boolean queries
![Page 34: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/34.jpg)
Indexing – Recap 2
• Signature files– an order (or two) of magnitude more storage than
inverted files– require un-necessary access to the main text because
of false matches– no in-memory lexicon– Insertions can be handled easily
• Coded (compressed) inverted files are the state-of-the art index structure used by most search engines
![Page 35: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/35.jpg)
Overview
• Brute force implementation
• Text analysis
• Indexing
• Index coding and query processing
• Web search engines
• Wrap-up
![Page 36: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/36.jpg)
Inverted file coding
• The inverted file entries are usually stored in order of increasing document number
– [<retrieval; 7; [2, 23, 81, 98, 121, 126, 180]>
(the term “retrieval” occurs in 7 documents with document identifiers 2, 23, 81, 98, etc.)
![Page 37: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/37.jpg)
Query processing (1)
• Each inverted file entry is an ascending sequence of integers – allows merging (joining) of two lists in a time
linear in the size of the lists– Advanced Database Applications (211090):
a merge join
![Page 38: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/38.jpg)
Query processing (2)
• Usually queries are assumed to be conjunctive queries
– query: information retrieval
– is processed as information AND retrieval
[<retrieval; 7; [2, 23, 81, 98, 121, 126, 139]>
[<information; 9; [1, 14, 23, 45, 46, 84, 98, 111, 120]>
– intersection of posting lists gives:[23, 98]
![Page 39: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/39.jpg)
Query processing (3)
• Remember the Boolean model?– intersection, union and complement is done
on posting lists
– so, information OR retrieval
[<retrieval; 7; [2, 23, 81, 98, 121, 126, 139]>
[<information; 9; [1, 14, 23, 45, 46, 84, 98, 111, 120]>
– union of posting lists gives:[1, 2, 14, 23, 45, 46, 81, 84, 98, 111, 120, 121, 126, 139]
![Page 40: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/40.jpg)
Query processing (4)
• Estimate of selectivity of terms:– Suppose information occurs on 1 billion pages– Suppose retrieval occurs on 10 million pages
• size of postings (5 bytes per docid):– 1 billion * 5B = 5 GB for information– 10 million * 5B = 50 MB for retrieval
• Hard disk transfer time:– 50 sec. for information + 0.5 sec. for retrieval– (ignore CPU time and disk latency)
![Page 41: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/41.jpg)
Query processing (6)
• We just brought query processing down from 10 days to just 50.5 seconds (!)
:-)
• Still... way too slow...
:-(
![Page 42: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/42.jpg)
Inverted file compression (1)
• Trick 1, store sequence of doc-ids:– [<retrieval; 7; [2, 23, 81, 98, 121, 126, 180]>
as a sequence of gaps– [<retrieval; 7; [2, 21, 58, 17, 23, 5, 54]>
• No information is lost. • Always process posting lists from the beginning,
so easily decoded into the original sequence
![Page 43: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/43.jpg)
Inverted file compression (2)
• Does it help?– maximum gap determined by the number of
indexed web pages...– infrequent terms coded as a few large gaps– frequent terms coded by many small gaps
• Trick 2: use variable byte length encoding.
![Page 44: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/44.jpg)
Variable byte encoding (1)
code: represent number x as:
– first bits as the unary code for
– remainder bits as binary code for – unary part specifies how many bits are required to code
the remainder part
• For example x = 5:
– first bits: 110
– remainder: 01
xlog1 2 xx log2
2
)332.215log1( 2 )12525( 25log2
![Page 45: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/45.jpg)
Variable byte encoding (2)
![Page 46: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/46.jpg)
Index sizes
![Page 47: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/47.jpg)
Index size of “our Google”
• Number of postings (term-document pairs):– 100 billion documents– 500 unique terms on average– Assume on average 6 bits per doc-id
• 500 x 1010 x 6 bits ~= 4TB– about 15% of the uncompressed inverted file.
![Page 48: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/48.jpg)
Query processing on compressed index
• size of postings (6 bits per docid):– 1 billion * 6 bits = 750 Mb for information– 10 million * 6 bits = 7.5 Mb for retrieval
• Hard disk transfer time:– 7.5 sec. for information + 0.08 sec. for
retrieval– (ignore CPU time and disk latency and
decompressing time)
![Page 49: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/49.jpg)
Query processing – Continued (1)
• We just brought query processing down from 10 days to just 50.5 seconds...
• and brought that down to 7.58 seconds
:-)
• but that is still too slow...
:-(
![Page 50: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/50.jpg)
Early termination (1)
• Suppose we re-sort the document ids for each posting such that the best documents come first – e.g., sort document identifiers for retrieval by their
tf.idf values.– [<retrieval; 7; [98, 23, 180, 81, 98, 121, 2,
126,]>– then: top 10 documents for retrieval can be retrieved
very quickly: stop after processing the first 10 document ids from the posting list!
– but compression and merging (multi-word queries) of postings no longer possible...
![Page 51: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/51.jpg)
Early termination (2)
• Trick 3: define a static (or global) ranking of all documents– such as Google PageRank (!)– re-assign document identifiers by ascending
PageRank– For every term, documents with a high Page-
Rank are in the initial part of the posting list– Estimate the selectivity of the query and only
process part of the posting files.
![Page 52: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/52.jpg)
Early termination (3)
• Probability that a document contains a term: – 1 billion / 10 billion = 0.1 for information– 10 million / 10 billion = 0.0001 for retrieval
• Assume independence between terms:– 0.1 x 0.0001 = 0.00001 of the documents contains
both terms– so, every 1 / 0.00001 = 10,000 documents on
average contains information AND retrieval.– for top 30, process 300,000 documents.– 300,000 / 10 billion = 0.00003 of the posting files
![Page 53: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/53.jpg)
Query processing on compressed index with early termination
• process about 0.00003 of postings:– 0.00003 * 750 Mb = 22.5 kb for information– 0.00003 * 7.5 Mb = 225 bytes for retrieval
• Hard disk transfer time:– 0.2 msec. for information + 0.002 msec. for
retrieval– (NB now, ignoring CPU time and disk latency
and decompressing time is no longer reasonable)
![Page 54: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/54.jpg)
Query processing – Continued (2)
• We just brought query processing down from 10 days to less than 1 ms. !
:-)
“This engine is incredibly, amazingly, ridiculously fast!”
(from “Top Gear” every Thursday on BBC2)
![Page 55: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/55.jpg)
Overview
• Brute force implementation
• Text analysis
• Indexing
• Compression
• Web search engines
• Wrap-up
![Page 56: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/56.jpg)
Web page ranking
• Varies by search engine– Pretty messy in many cases– Details usually proprietary and fluctuating
• Combining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity information
![Page 57: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/57.jpg)
What about Google
• Google maintains the worlds largest Linux cluster (10,000 servers)
• These are partitioned between index servers and page servers– Index servers resolve the queries (massively
parallel processing)– Page servers deliver the results of the queries
• Over 8 Billion web pages are indexed and served by Google
![Page 58: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/58.jpg)
Google: Architecture
(Brin & Page 1997)
![Page 59: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/59.jpg)
Google: Zlib compression
• A variant of LZ77 (gzip)
![Page 60: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/60.jpg)
Google: Forward & Inverted Index
![Page 61: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/61.jpg)
Google: Query evaluation
• Parse the query. • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for
every word. • Scan through the doclists until there is a document that
matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
• If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.
![Page 62: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/62.jpg)
Google: Storage numbers
Total Size of Fetched Pages 147.8 GB
Compressed Repository 53.5 GB
Short Inverted Index 4.1 GB
Full Inverted Index 37.2 GB
Lexicon 293 MB
Temporary Anchor Data (not in total)
6.6 GB
Document Index Incl. Variable Width Data
9.7 GB
Links Database 3.9 GB
Total Without Repository 55.2 GB
Total With Repository 108.7 GB
![Page 63: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/63.jpg)
Google: Page search
Web Page Statistics
Number of Web Pages Fetched 24 million
Number of URLs Seen 76.5 million
Number of Email Addresses 1.7 million
Number of 404's 1.6 million
![Page 64: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/64.jpg)
Google: Search speed
Initial QuerySame Query Repeated (IO
mostly cached)
Query CPUTime(s) Total Time(s) CPU Time(s) Total Time(s)
al gore 0.09 2.13 0.06 0.06
vicepresident
1.77 3.84 1.66 1.80
hard disks
0.25 4.86 0.20 0.24
searchengines
1.31 9.63 1.16 1.16
![Page 65: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/65.jpg)
Q’s
• What about web search today?
• How many pages?
• How many searches per second?
• Who is the best?
![Page 66: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/66.jpg)
Web search November 2004
Search Engine Reported Size
Google 8.1 billion
MSN 5.0 billion
Yahoo 4.2 billion(estimate)
Ask Jeeves 2.5 billion
http://blog.searchenginewatch.com/blog/041111-084221
![Page 67: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/67.jpg)
Web search February 2003
Service Searches per dayGoogle 250 million Overture (Yahoo) 167 millionInktomi (Yahoo) 80 millionLookSmart (MSN) 45 millionFindWhat 33 millionAsk Jeeves 20 millionAltaVista 18 millionFAST 12 million
http://searchenginewatch.com/reports/article.php/2156461
![Page 68: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/68.jpg)
Web search July 2005 (US)
http://searchenginewatch.com/reports/
![Page 69: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/69.jpg)
Overview
• Brute force implementation
• Text analysis
• Indexing
• Compression
• Web search engines
• Wrap-up
![Page 70: Information Retrieval Implementation issues](https://reader036.vdocuments.net/reader036/viewer/2022062321/56813c26550346895da59eff/html5/thumbnails/70.jpg)
Summary
• Term distribution and statistics– What is useful and what is not
• Indexing techniques (inverted files)– How to store the web
• Compression, coding, and querying– How to squeeze the web for efficient search
• Search engines– Google: first steps and now