cs246 basic information retrieval. today’s topic basic information retrieval (ir) bag of words...

21
CS246 Basic Information Retrieval

Upload: darren-holland

Post on 15-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

CS246

Basic Information Retrieval

Page 2: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Today’s Topic Basic Information Retrieval (IR)

Bag of words assumption Boolean Model

Inverted index Vector-space model

Document-term matrix TF-IDF vector and cosine similarity

Phrase queries Spell correction

Page 3: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Information-Retrieval System Information source: Existing text documents Keyword-based/natural-language query The system returns best-matching documents

given the query Challenge

Both queries and data are “fuzzy” Unstructured text and “natural language” query

What documents are good matches for a query? Computers do not “understand” the documents or the

queries Developing a computerizable “model” is essential to

implement this approach

Page 4: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Bag of Words: Major Simplification Consider each document as a “bag of words”

“bag” vs “set” Ignore word ordering, but keep word count

Consider queries as bag of words as well Great oversimplification, but works adequately

in many cases “John loves only Jane” vs “Only John loves Jane” The limitation still shows up on current search

engines Still how do we match documents and

queries?

Page 5: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Boolean Model Return all documents that contain the words

in the query Simplest model for information retrieval

No notion of “ranking” A document is either a match or non-match

Q: How to find and return matching documents? Basic algorithm? Useful data structure?

Page 6: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Inverted Index Allows quick lookup of document ids with a

particular word

Q: How can we use this to answer “UCLA Physics”?

lexicon/dictionary DIC 3 8 10 13 16 20

Stanford

UCLA

MIT

1 2 3 9 16 18

PL(Stanford)

PL(UCLA)

Postings list

4 5 8 10 13 19 20 22 PL(MIT)

Page 7: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Inverted Index Allows quick lookup of document ids with a

particular word

lexicon/dictionary DIC 3 8 10 13 16 20

Stanford

UCLA

MIT

1 2 3 9 16 18

PL(Stanford)

PL(UCLA)

Postings list

4 5 8 10 13 19 20 22 PL(MIT)

Page 8: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Size of Inverted Index (1) 100M docs, 10KB/doc,

1000 unique words/doc, 10B/word, 4B/docid

Q: Document collection size?

Q: Inverted index size?

Heap’s Law: Vocabulary size = k nb with 30 < k < 100 and 0.4 < b < 1 k = 50 and b = 0.5 are good rule of thumb

Page 9: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Size of Inverted Index (2) Q: Between dictionary and postings lists,

which one is larger?

Q: Lengths of postings lists?

Zipf’s law: collection term frequency 1/frequency rank

Q: How do we construct an inverted index?

Page 10: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Inverted Index ConstructionC: set of all documents (corpus)DIC: dictionary of inverted indexPL(w): postings list of word w

1: For each document d C:2: Extract all words in content(d) into W3: For each w W:4: If w DIC, then add w to DIC5: Append id(d) to PL(w)

Q: What if the index is larger than main memory?

Page 11: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Inverted-Index Construction For large text corpus

Block-sorted based construction Partition and merge

Page 12: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Evaluation: Precision and Recall Q: Are all matching documents what users

want?

Basic idea: a model is good if it returns document if and only if it is “relevant”.

R: set of “relevant” documentD: set of documents returned by a model

||

||Precision

D

RD

||

||Recall

R

RD

Page 13: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Vector-Space Model Main problem of Boolean model

Too many matching documents when the corpus is large

Any way to “rank” documents? Matrix interpretation of Boolean model

Document – Term matrix Boolean 0 or 1 value for each entry

Basic idea Assign real-valued weight to the matrix entries

depending on the importance of the term “the” vs “UCLA”

Q: How should we assign the weights?

Page 14: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

TF-IDF Vector A term t is important for document d

If t appears many times in d or If t is a “rare” term

TF: term frequency # occurrence of t in d

IDF: inverse document frequency # documents containing t

TF-IDF weighting TF X Log(N/IDF)

Q: How to use it to compute query-document relevance?

Page 15: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Cosine Similarity Represent both query and document as a TF-

IDF vector Take the inner product of the two normalized

vectors to compute their similarity

Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.

DQ

DQ

Page 16: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Cosine Similarity: Example idf(UCLA)=10, idf(good)=0.1,

idf(university) = idf(car) = idf(racing) = 1

Q = (UCLA, university), D = (car, racing)

Q = (UCLA, university), D = (UCLA, good)

Q = (UCLA, university), D = (university, good)

Page 17: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Finding High Cosine-Similarity Documents Q: Under vector-space model, does

precision/recall make sense?

Q: How to find the documents with highest cosine similarity from corpus?

Q: Any way to avoid complete scan of corpus?

Page 18: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Inverted Index for TF-IDF Q · di = 0 if di has no query words Consider only the documents with query

words Inverted Index: Word Document

18

Word IDF

Stanford

UCLA

MIT

1/3530

1/9860

1/937

docid TF

D1

D14

D376

2

308

(TF may be normalized by document size)

Postinglist

Lexicon

Page 19: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Phrase Queries “Havard University Boston” exactly as a

phrase Q: How can we support this query?

Two approaches Biword index Positional index

Q: Pros and cons of each approach?

Rule of thumb: x2 – x4 size increase for positional index compared to docid only

Page 20: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Spell correction Q: What is the user’s intention for the query

“Britnie Spears”? How can we find the correct spelling?

Given a user-typed word w, find its correct spelling c. Probabilistic approach: Find c with the highest

probability P(c|w). Q: How to estimate it?

Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w) Q: What are these probabilities and how can we

estimate them? Rule of thumb: 75% misspells are within edit

distance 1. 98% are within edit distance 2.

Page 21: CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space

Summary Boolean model Vector-space model

TF-IDF weight, cosine similarity Inverted index

Boolean model TF-IDF model Phrase queries

Spell correction