hinrich schütze and christina lioma lecture 3: dictionaries and tolerant retrieval

67
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval 1

Upload: gilon

Post on 11-Jan-2016

88 views

Category:

Documents


2 download

DESCRIPTION

Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval. Outline. Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex. Sec. 3.1. Dictionary data structures for inverted indexes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Introduction to

Information Retrieval

Hinrich Schütze and Christina LiomaLecture 3: Dictionaries and tolerant retrieval

1

Page 2: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Dictionaries❸ Wildcard queries

❹ Edit distance

❺ Spelling correction

❻ Soundex

2

Page 3: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Dictionary data structures for inverted indexes• The dictionary data structure stores the term

vocabulary, document frequency, pointers to each postings list … in what data structure?

Sec. 3.1

Page 4: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

A naïve dictionary• An array of struct:

char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes • How do we store a dictionary in memory efficiently?• How do we quickly look up elements at query time?

Sec. 3.1

Page 5: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Dictionary data structures• Two main choices:– Hash table– Tree

• Some IR systems use hashes, some trees

Sec. 3.1

Page 6: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Hashes• Each vocabulary term is hashed to an integer– (We assume you’ve seen hashtables before)

• Pros:– Lookup is faster than for a tree: O(1)

• Cons:– No easy way to find minor variants:

• judgment/judgement

– No prefix search [tolerant retrieval]– If vocabulary keeps growing, need to occasionally do the

expensive operation of rehashing everything

Sec. 3.1

Page 7: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Trees• Simplest: binary tree• More usual: B-trees• Trees require a standard ordering of characters and hence

strings … but we standardly have one• Pros:– Solves the prefix problem (terms starting with hyp)

• Cons:– Slower: O(log M) [and this requires balanced tree]– Rebalancing binary trees is expensive

• But B-trees mitigate the rebalancing problem

Sec. 3.1

Page 8: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

8

Binary tree

8

Page 9: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Tree: B-tree

– Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].

a-huhy-m

n-z

Sec. 3.1

Page 10: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Dictionaries❸ Wildcard queries

❹ Edit distance

❺ Spelling correction

❻ Soundex

10

Page 11: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Wild-card queries: *

• mon*: find all docs containing any word beginning “mon”.

• Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo

• *mon: find words ending in “mon”: harder– Maintain an additional B-tree for terms backwards.Can retrieve all words in range: nom ≤ w < non.

Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?

Sec. 3.2

Page 12: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

12

How to handle * in the middle of a term

Example: m*nchen We could look up m* and *nchen in the B-tree and intersect

the two term sets. Expensive Alternative: permuterm index Basic idea: Rotate every wildcard query, so that the * occurs

at the end. Store each of these rotations in the dictionary, say, in a B-

tree

12

Page 13: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

B-trees handle *’s at the end of a query term• How can we handle *’s in the middle of query term?– co*tion

• We could look up co* AND *tion in a B-tree and intersect the two term sets– Expensive

• The solution: transform wild-card queries so that the *’s occur at the end

• This gives rise to the Permuterm Index.

Sec. 3.2

Page 14: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

14

Permuterm index

For term HELLO: add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol

14

Page 15: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

15

Permuterm → term mapping

15

Page 16: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

16

Permuterm index For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and

o$hell Queries

For X, look up X$ For X*, look up $X* For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel*

Permuterm index would better be called a permuterm tree. But permuterm index is the more common name.

16

Page 17: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

17

Processing a lookup in the permuterm index

Rotate query wildcard to the right Use B-tree lookup as before Problem: Permuterm more than quadruples the size of the

dictionary compared to a regular B-tree. (empirical number)

17

Page 18: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

18

k-gram indexes

More space-efficient than permuterm index Enumerate all character k-grams (sequence of k characters)

occurring in a term 2-grams are called bigrams. Example: from April is the cruelest month we get the bigrams:

$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$

$ is a special word boundary symbol, as before. Maintain an inverted index from bigrams to the terms that

contain the bigram

18

Page 19: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

19

Postings list in a 3-gram inverted index

19

Page 20: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

20

k-gram (bigram, trigram, . . . ) indexes

Note that we now have two different types of inverted indexes

The term-document inverted index for finding documents based on a query consisting of terms

The k-gram index for finding terms based on a query consisting of k-grams

20

Page 21: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

21

Processing wildcarded terms in a bigram index

Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon . . . . . . but also many “false positives” like MOON. We must postfilter these terms against query. Surviving terms are then looked up in the term-document

inverted index. k-gram index vs. permuterm index

k-gram index is more space efficient. Permuterm index doesn’t require postfiltering.

21

Page 22: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Processing wild-card queries• As before, we must execute a Boolean query for each

enumerated, filtered term.• Wild-cards can result in expensive query execution

(very large disjunctions…)– pyth* AND prog*

• If you encourage “laziness” people will respond!

• Which web search engines allow wildcard queries?

Search

Type your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.

Sec. 3.2.2

Page 23: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Dictionaries❸ Wildcard queries

❹ Edit distance

❺ Spelling correction

❻ Soundex

23

Page 24: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

24

Distance between misspelled word and “correct” word

We will study several alternatives. weighted edit distance Edit distance and Levenshtein distance k-gram overlap

24

Page 25: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

25

Weighted edit distance

As above, but weight of an operation depends on the characters involved.

Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q.

Therefore, replacing m by n is a smaller edit distance than by q.

25

Page 26: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

26

Edit distance The edit distance between string s1 and string s2 is the

minimum number of basic operations that convert s1 to s2. Levenshtein distance: The admissible basic operations are

insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2

26

Page 27: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

27

Levenshtein distance: Algorithm

27

Page 28: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

28

Levenshtein distance: Algorithm

28

Page 29: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

29

Levenshtein distance: Algorithm

29

Page 30: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

30

Levenshtein distance: Algorithm

30

Page 31: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

31

Levenshtein distance: Algorithm

31

Page 32: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

32

Each cell of Levenshtein matrix

32

cost of getting here frommy upper left neighbor(copy or replace)

cost of getting herefrom my upper neighbor(delete)

cost of getting here frommy left neighbor (insert)

the minimum of the three possible “movements”;the cheapest way of getting here

Page 33: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

33

Levenshtein distance: Example

33

Page 34: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

34

Exercise

❶Compute Levenshtein distance matrix for OSLO – SNOW❷What are the Levenshtein editing operations that transform

cat into catcat?

34

Page 35: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

3535

Page 36: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

3636

Page 37: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

3737

Page 38: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

3838

Page 39: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

3939

Page 40: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4040

Page 41: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4141

Page 42: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

42

How do I read out the editing operations that transform OSLO into SNOW?

42

Page 43: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4343

Page 44: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4444

Page 45: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4545

Page 46: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4646

Page 47: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4747

Page 48: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4848

Page 49: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

4949

Page 50: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

5050

Page 51: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

5151

Page 52: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

5252

Page 53: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Dictionaries❸ Wildcard queries

❹ Edit distance

❺ Spelling correction

❻ Soundex

53

Page 54: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Spell correction• Two main flavors:– Isolated word

• Check each word on its own for misspelling• Will not catch typos resulting in correctly spelled words• e.g., from form

– Context-sensitive• Look at surrounding words, • e.g., I flew form Heathrow to Narita.

Sec. 3.3

Page 55: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

55

Correcting queries

First: isolated word spelling correction Premise 1: There is a list of “correct words” from which the

correct spellings come. Premise 2: We have a way of computing the distance

between a misspelled word and a correct word. Simple spelling correction algorithm: return the “correct”

word that has the smallest distance to the misspelled word. Example: informaton → information For the list of correct words, we can use the vocabulary of all

words that occur in our collection. Why is this problematic?

55

Page 56: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

56

Alternatives to using the term vocabulary

A standard dictionary (Webster’s, OED etc.) An industry-specific dictionary (for specialized IR systems)

56

Page 57: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

57

k-gram indexes for spelling correction

Enumerate all k-grams in the query term Example: bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Use the k-gram index to retrieve “correct” words that match

query term k-grams Threshold by number of matching k-grams

57

Page 58: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

58

k-gram indexes for spelling correction: bordroom

58

….

Page 59: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Jaccard coefficient• A commonly-used measure of overlap• Let X and Y be two sets; then the J.C. is

• Equals 1 when X and Y have the same elements and zero when they are disjoint

• X and Y don’t have to be of the same size

YXYX /

Sec. 3.3.4

Page 60: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Example : bigrams match• Query A = bordroom reaches term B = boardroom• A = bo or rd dr ro oo om• B = bo oa ar rd dr ro oo om

• = 2 / (7 + 8 -6) = 6/9

60

BABA /

Page 61: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Context-sensitive spell correction• Text: I flew from Heathrow to Narita.• Consider the phrase query “flew form Heathrow”• We’d like to respond

Did you mean “flew from Heathrow”?because no docs matched the query phrase.

Sec. 3.3.5

Page 62: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Context-sensitive correction• Need surrounding context to catch this.• First idea: retrieve dictionary terms close (in

weighted edit distance) to each query term• Now try all possible resulting phrases with one word

“fixed” at a time– flew from heathrow – fled form heathrow– flea form heathrow

• Hit-based spelling correction: Suggest the alternative that has lots of hits.

Sec. 3.3.5

Page 63: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Dictionaries❸ Wildcard queries

❹ Edit distance

❺ Spelling correction

❻ Soundex

63

Page 64: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

64

Soundex

Soundex is the basis for finding phonetic (as opposed to orthographic) alternatives.

Example: chebyshev / tchebyscheff Algorithm:

Turn every token to be indexed into a 4-character reduced form Do the same with query terms Build and search an index on the reduced forms

64

Page 65: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

65

Soundex algorithm❶ Retain the first letter of the term.❷ Change all occurrences of the following letters to ’0’ (zero): A, E, I,

O, U, H, W, Y❸ Change letters to digits as follows:

B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6

❹ Repeatedly remove one out of each pair of consecutive identical digits

❺ Remove all zeros from the resulting string; pad the resulting string with trailing zeros and return the first four positions, which will consist of a letter followed by three digits 65

Page 66: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

66

Example: Soundex of HERMAN

Retain H ERMAN → 0RM0N 0RM0N → 06505 06505 → 06505 06505 → 655 Return H655 Note: HERMANN will generate the same code

66

Page 67: Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Introduction to Information RetrievalIntroduction to Information Retrieval

67

How useful is Soundex?

Not very – for information retrieval OK for similar-sounding terms. The idea owes its origins to

work in international police department , seeking to match names for wanted criminals despite the names being spelled differently in different countries.

67