prasadl3invertedindex1 inverted index construction adapted from lectures by prabhakar raghavan...

28
Prasad L3InvertedIndex 1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Upload: preston-bradford

Post on 22-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 1

Inverted Index Construction

Adapted from Lectures byPrabhakar Raghavan (Yahoo and

Stanford) and Christopher Manning (Stanford)

Page 2: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 2

Unstructured data in 1650

Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

One could grepgrep all of Shakespeare’s plays for Brutus and Caesar, then strip out plays containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word

Romans near countrymen) not feasible

Page 3: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 3

Term-document incidence

1 if play contains word, 0 otherwise

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Brutus AND Caesar but NOT Calpurnia

Page 4: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 4

Incidence vectors

So we have a 0/1 vector for each term.

To answer query: take the vectors for Brutus, Caesar and

Calpurnia (complemented) bitwise AND.

110100 AND 110111 AND 101111 = 100100.

Page 5: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 5

Answers to query

Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Page 6: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 6

Bigger corpora

Consider N = 1M documents, each with about 1K terms.

Avg 6 bytes/term including spaces/punctuation

6GB of data in the documents.

Say there are m = 500K distinct terms among these.

Page 7: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 7

Can’t build the matrix

500K x 1M matrix has half-a-trillion 0’s and 1’s.

But it has no more than one billion 1’s. matrix is extremely sparse.

What’s a better representation? We only record the 1 positions.

Why?

Page 8: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 8

Inverted index

For each term T, we must store a list of all documents that contain T.

Do we use an array or a list for this?

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

What happens if the word Caesar is added to document 14?

Page 9: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 9

Inverted index

Linked lists generally preferred to arrays+ Dynamic space allocation+ Insertion of terms into documents easy− Space overhead of pointers

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

Sorted by docID (more later on why).

Posting

Page 10: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Inverted index construction

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

More onthese later.

Documents tobe indexed.

Friends, Romans, countrymen.

Page 11: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex

Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Indexer steps

Page 12: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex

Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Core indexing step.

Page 13: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Multiple term entries in a single document are merged.

Frequency information is added.

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Why frequency?Will discuss later.

Page 14: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

L3InvertedIndex

The result is split into a Dictionary file and a Postings file.

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Page 15: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad 15

Where do we pay in storage? Doc # Freq

2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Pointers

Terms

Will quantify the storage, later.

Page 16: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 16

Query Processing

How?

What?

Page 17: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 17

Query processing: AND

Consider processing the query:Brutus AND Caesar Locate Brutus in the Dictionary;

Retrieve its postings. Locate Caesar in the Dictionary;

Retrieve its postings. “Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13

21

Brutus

Caesar

Page 18: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 18

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge

Walk through the two postings simultaneously, in time linear in the total number of postings entries

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

Page 19: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 19

Boolean queries: Exact match

Boolean Queries are queries using AND, OR and NOT to join query terms

Views each document as a set of words Is precise: document matches condition or not.

Primary commercial retrieval tool for 3 decades.

Professional searchers (e.g., lawyers) still like Boolean queries: You know exactly what you’re getting.

Page 20: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 20

Example: WestLaw http://www.westlaw.com/

Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query:

What is the statute of limitations in cases involving the federal tort claims act?

LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

/3 = within 3 words, /S = in same sentence

Page 21: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 21

Example: WestLaw http://www.westlaw.com/

Another example query: Requirements for disabled people to be able to access

a workplace disabl! /p access! /s work-site work-place

(employment /3 place Note that SPACE is disjunction, not conjunction! Long, precise queries; proximity operators;

incrementally developed; not like web search Professional searchers often like Boolean search:

Precision, transparency and control But that doesn’t mean they actually work better ...

Page 22: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 22

Query optimization

Consider a query that is an AND of t terms. For each of the t terms, get its postings,

then AND them together. What is the best order for query

processing?

Brutus

Calpurnia

Caesar

1 2 3 5 8 16 21 34

2 4 8 16 32 64128

13 16

Query: Brutus AND Calpurnia AND Caesar

Page 23: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 23

Query optimization example

Process in order of increasing freq: start with smallest set, then keep cutting

further.

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

This is why we keptfreq in dictionary

Execute the query as (Caesar AND Brutus) AND Calpurnia.

Page 24: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 24

More general optimization

e.g., (madding OR crowd) AND (ignoble OR strife)

Get freq’s for all terms. Estimate the size of each OR by the

sum of its freq’s (conservative). Process in increasing order of OR

sizes.

Page 25: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 25

Space Requirements

The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(n), where is a constant between 0.4 and 0.6 in practice.

Size of inverted file as a percentage of text (all words, non-stop words)

45%

19%

18%

73%

26%

25%

36%

18%

1.7%

64%

32%

2.4%

35%

26%

0.5%

63%

47%

0.7%

Addressing words

Addressing documents

Addressing 256 blocks

Index Small collection

(1Mb)

Medium collection

(200Mb)

Large collection

(2Gb)

Page 26: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 26

Space Requirements

To reduce space requirements, a technique called block addressing can be used Advantages:

the number of pointers is smaller than positions

all the occurrences of a word inside a single block are collapsed to one reference

Disadvantages: online (dynamic) search over the qualifying

blocks necessary if exact positions are required

Page 27: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 27

What’s ahead in IR?Beyond term search

What about phrases? Stanford University

Proximity: Find Gates NEAR Microsoft. Need index to capture position information

in docs. More later. Zones in documents: Find documents with

(author = Ullman) AND (text contains automata).

Page 28: PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad L3InvertedIndex 28

Other Indexing Techniques

Even though Inverted Files is the method of choice, in the face of phrase and proximity queries, the following approaches were also developed:

Suffix arrays

Signature files