information retrieval lecture 1 introduction to information retrieval (manning et al. 2007) chapter...
TRANSCRIPT
Information Retrieval
Lecture 1Introduction to Information Retrieval (Manning et al. 2007)
Chapter 1
For the MSc Computer Science Programme
Dell ZhangBirkbeck, University of London
What is IR?
IR is about search engines.
Search Engine
Unstructured Data (Text)Keyword Queries
Database System
Structured Data (Tables)SQL Queries
VS
Structured data has been the big commercial success (e.g., Oracle) but unstructured data is now becoming dominant in a large and increasing range of activities.
Search is Cool
Text everywhere books, documents (ms-word, pdf, etc.), articles
(journal, magazine, newspaper, etc.), Web pages, emails, SMS, chat, …
Much more than text music, photos, videos, …
Much more than the Web personal enterprise, institutional and domain-specific
Search is Cool
Not only search/finding, but also organization and mining clustering classification ……
For example, given thousaunds of CVs, …
http://commons.wikimedia.org/wiki/Image:Bookshelf.jpg
Search is Hot
Most people’s means of information access In the 1990s: other people Nowadays: Web search
“92% of Internet users say the Internet is a good place to go for getting everyday information” – (2004 Pew Internet Survey)
The Search Engine War Google, Yahoo, Microsoft, …
Video: How Big Can You Think?
Data: Unstructured vs. Structured
0
20
40
60
80
100
120
140
160
Data volume Market Cap
UnstructuredStructured
in 1996
Data: Unstructured vs. Structured
0
20
40
60
80
100
120
140
160
Data volume Market Cap
UnstructuredStructured
in 2006
Free Textbook Online
C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://www-csli.stanford.edu/~schuetze/
information-retrieval-book.html
Head of Yahoo! Research
Don’t remember. Search!
Another reason for you to come to this module.
A Simple Search Engine
http://www.rhymezone.com/shakespeare/ Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near
countrymen) not feasible
“to be, or not to be”
How do you search in a book?
Boolean Queries
Boolean Queries are queries using AND, OR and NOT together with query terms. Each document is viewed as a set of words. Precise: a document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers still like Boolean queries -
you know exactly what you’re getting. For example, http://www.westlaw.com/ .
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Term-Document Incidence
1 if play contains word, 0 otherwiseBrutus AND Caesar but NOT
Calpurnia
Incidence Vectors
So we have a 0/1 vector for each term. To answer the query
take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.
110100 & 110111 & 101111 = 100100
Query Results
Antony and Cleopatra, Act III, Scene ii……Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,He cried almost to roaring; and he weptWhen at Philippi he found Brutus slain.
……
Hamlet, Act III, Scene ii…….Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.……
Bigger Corpora
Consider n = 1M documents each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation
6GB of data in the documents. Say there are m = 500K distinct terms among
these.
Bigger Corpora
The 500K x 1M matrix has half-a-trillion 0’s and 1’s, but no more than one billion 1’s.
The matrix is extremely sparse.
Can’t build the matrix straightforwardly. What’s a better representation?
Only record the 1 positions.
Why?
Inverted Index
For each term T, we must store a list of all documents that contain T.
Do we use an array or a list for this?
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 34
2 4 8 16 32 64 128
13 16
What happens if the word Caesar is added to document 14?
Inverted Index
Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings
(sorted by docID)
Inverted Index - Construction
Tokenization
Token Stream Friends Romans Countrymen
Linguistic Preprocessi
ngModified Tokens friend roman countryman
Indexing
Inverted Index
friend
roman
countryman
2 4
2
13 16
1
Documents(to be indexed)
Friends, Romans, countrymen.
Linguistic Preprocessing
Case-folding Often best to lower case everything, since users
will use lowercase regardless of ‘correct’ capitalization.
Removal of stopwords Very common words like the, of, to, etc. List of stopwords (stop lists)
http://www.dcs.gla.ac.uk/idom/ir_resources/ linguistic_utils/stop_words
http://www.ranks.nl/tools/stopwords.html
Linguistic Preprocessing
Stemming Reduce terms to their “roots” before indexing,
e.g., automate(s), automatic, automation all reduced to automat.
Porter’s stemmer Implementations in C, Java, Perl, Python, etc. http://www.tartarus.org/~martin/PorterStemmer/
……
Input a sequence of (term, docID) pairs
I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Indexing
Sort by terms.
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
This is the core indexing step,
Merge multiple term entries in a single
document. Add
frequency information.
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Why?Will discuss later.
Split the result into a dictionary file and a postings file.
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Where do we pay in storage?
The index we just built
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
How do we process a query?
Query Processing
Consider processing the query:Brutus AND Caesar Locate Brutus in the dictionary;
Retrieve its postings. Locate Caesar in the dictionary;
Retrieve its postings. “Merge” the two postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13
21
Brutus
Caesar
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
Query Processing - Merge
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar2 8
What’s crucial: postings must be sorted by docID.
Walk through the two postings simultaneously, in time linear in the total number of postings entries. In other words, if the posting list lengths are x and y, the merge takes O(x+y) operations.
Query Processing - Exercise
Adapt the merge algorithm for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar
Can we still run through the merge in time O(x+y)?
Query Processing - Exercise
What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR
Cleopatra)
Can we always merge in “linear” time? The time complexity is linear in what?
Can we do better?
Query Processing - Exercise
Extend the merge algorithm to an arbitrary Boolean query.
Can we always guarantee execution in time linear in the total postings size?
[Hint] Begin with the case of a Boolean formula query, then each query term appears only once in the query.
Take Home Messages
IR is about search engines. Search is hot! Search is cool!
Free textbook online! http://www.dcs.bbk.ac.uk/~dell/teaching/ir/
Inverted Index dictionary + postings
Boolean Search query processing