information retrieval with open source

57

Upload: korzonek

Post on 20-Jun-2015

2.390 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Information Retrieval with Open Source
Page 3: Information Retrieval with Open Source

Information Retrieval

Page 4: Information Retrieval with Open Source
Page 5: Information Retrieval with Open Source
Page 6: Information Retrieval with Open Source
Page 7: Information Retrieval with Open Source

Retrieval strategies

• Vector Space Model

• Latent Semantic Indexing

• Probabilistic Retrieval Strategies

• Language Models

• Inference Networks

• Extended Boolean Retrieval

• Neural Networks

• Genetic Algorithms

• Fuzzy Set Retrieval

Page 8: Information Retrieval with Open Source

Vector space model

Page 9: Information Retrieval with Open Source

Text retrieval

Page 10: Information Retrieval with Open Source

Analysis

Page 11: Information Retrieval with Open Source

Tokenization

Page 12: Information Retrieval with Open Source

Stop-words

Page 13: Information Retrieval with Open Source

Stemming

Lemmatization

Page 14: Information Retrieval with Open Source

http://tartarus.org/~martin/PorterStemmer/

Page 15: Information Retrieval with Open Source

Document

Term

Page 16: Information Retrieval with Open Source
Page 17: Information Retrieval with Open Source

Term frequency

Page 18: Information Retrieval with Open Source

Inversion document frequency

Preliminary draft (c)�2007 Cambridge UP

6.2 Term frequency and weighting 91

Word cf dfferrari 10422 17insurance 10440 3997

! Figure 6.3 Collection frequency (cf) and document frequency (df) behave differ-ently.

term dft idft

calpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0

! Figure 6.4 Example of idf values. Here we give the idf’s of terms with variousfrequencies in a corpus of 1,000,000 documents.

a term t. The reason to prefer df to cf is illustrated in Figure 6.3, where a sim-ple example shows that collection frequency (cf) and document frequency(df) can behave rather differently. In particular, the cf values for both ferrari

and insurance are roughly equal, but their df values differ significantly. Thissuggests that the few documents that do contain ferrari mention this term fre-quently, so that its cf is high but the df is not. Intuitively, we want such termsto be treated differently: the few documents that contain ferrari should geta significantly higher boost for a query on ferrari than the many documentscontaining insurance get from a query on insurance.

How is the document frequency df of a term used to scale its weight? De-noting as usual the total number of documents in a corpus by N, we definethe inverse document frequency (idf) of a term t as follows:INVERSE DOCUMENT

FREQUENCY

idft = logN

dft.(6.1)

Thus the idf of a rare term is high, whereas the idf of a frequent term islikely to be low. Figure 6.4 gives an example of idf’s in a corpus of 1,000,000documents; in this example logarithms are to the base 10.

Exercise 6.2

Why is the idf of a term always finite?

Exercise 6.3

What is the idf of a term that occurs in every document? Compare this with the useof stop word lists.

Page 19: Information Retrieval with Open Source

Preliminary draft (c)�2007 Cambridge UP

92 6 Scoring and term weighting

6.2.2 Tf-idf weighting

We now combine the above expressions for term frequency and inverse doc-ument frequency, to produce a composite weight for each term in each doc-ument. The tf-idf weighting scheme assigns to term t a weight in documentd given by

tf-idft,d = tft,d ! idft.(6.2)

In other words, tf-idft,d assigns to term t a weight in document d that is

1. highest when t occurs many times within a small number of documents(thus lending high discriminating power to those documents);

2. lower when the term occurs fewer times in a document, or occurs in manydocuments (thus offering a less pronounced relevance signal);

3. lowest when the term occurs in virtually all documents.

At this point, we may view each document as a vector with one componentcorresponding to each term, together with a weight for each component thatis given by (6.2). This vector form will prove to be crucial to scoring andranking; we will develop these ideas in Chapter 7. As a first step, we intro-duce the overlap score measure: the score of a document d is the sum, over allquery terms, of the number of times each of the query terms occurs in d. Wecan refine this idea so that we add up not the number of occurrences of eachquery term t in d, but instead the tf-idf weight of each term in d.

Score(q, d) = !t"q

tf-idft,d.(6.3)

Exercise 6.4

Can the tf-idf weight of a term in a document exceed 1?

Exercise 6.5

How does the base of the logarithm in (6.1) affect the score calculation in (6.3)? Howdoes the base of the logarithm affect the relative scores of two documents on a givenquery?

Exercise 6.6

If the logarithm in (6.1) is computed base 2, suggest a simple approximation to the idfof a term.

6.3 Variants in weighting functions

A number of alternative schemes to tf and tf-idf have been considered; wediscuss some of the principal ones here.

Page 20: Information Retrieval with Open Source

Search

Page 21: Information Retrieval with Open Source

Preliminary draft (c)�2007 Cambridge UP

100 7 Vector space retrieval

!

"

!!

!!

!!

!!

!!"

#############$

%%%%%%%%%%%%%%&

!v(q)

!v(d2)

!v(d2)

! Figure 7.1 Cosine similarity illustrated.

John” and “John is quicker than Mary” are identical in such a bag of wordsrepresentation.

How do we quantify the similarity between two documents in this vectorspace? A first attempt might consider the magnitude of the vector differencebetween two document vectors. This measure suffers from a drawback: twodocuments with very similar term distributions can have a significant vectordifference simply because one is much longer than the other. Thus the rel-ative distributions of terms may be identical in the two documents, but theabsolute term frequencies of one may be far larger.

To compensate for the effect of document length, the standard way ofquantifying the similarity between two documents d1 and d2 is to computethe cosine similarityof their vector representations !V(d1) and !V(d2)COSINE SIMILARITY

sim(d1, d2) =!V(d1) · !V(d2)

|!V(d1)||!V(d2)|,(7.1)

where the numerator represents the inner product (also known as the dotproduct) of the vectors !V(d1) and !V(d2), while the denominator is the prod-ucts of their lengths. The effect of the denominator is to normalize the vec-tors !V(d1) and !V(d2) to unit vectors !v(d1) = !V(d1)/|!V(d1)| and !v(d2) =!V(d2)/|!V(d2)|. We can then rewrite (7.1) as

sim(d1, d2) = !v(d1) ·!v(d2).(7.2)

Thus, (7.2) can be viewed as the inner product of the normalized versions ofthe two document vectors. What use is the similarity measure sim(d1, d2)?Given a document d (potentially one of the di in the collection), consider

Page 22: Information Retrieval with Open Source
Page 23: Information Retrieval with Open Source

Q: “gold silver truck”

D1: “Shipment of gold damaged in a fire”

D2: “Delivery of silver arrived in a silver truck”

D3: “Shipment of gold arrived in a truck”

Page 24: Information Retrieval with Open Source

a arrived damaged delivery fire gold in of shipment silver truck

D1 1 0 1 0 1 1 1 1 1 0 0

D2 1 1 0 1 0 0 1 1 0 2 0

D3 1 1 0 0 0 1 1 1 1 0 1

Q 0 0 0 0 0 1 0 0 0 1 1

TF

Page 25: Information Retrieval with Open Source

• a log 3/3 = 0

• arrived log 3/2 = 0.176

• damaged log 3/1 = 0.477

• delivery log 3/1 = 0.477

• fire log 3/1 = 0.477

• in log 3/3 = 0

• of log 3/3 = 0

• silver log 3/1 = 0.477

• shipment log 3/2 = 0.176

• truck log 3/2 = 0.176

• gold log 3/2 = 0.176

Preliminary draft (c)�2007 Cambridge UP

6.2 Term frequency and weighting 91

Word cf dfferrari 10422 17insurance 10440 3997

! Figure 6.3 Collection frequency (cf) and document frequency (df) behave differ-ently.

term dft idft

calpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0

! Figure 6.4 Example of idf values. Here we give the idf’s of terms with variousfrequencies in a corpus of 1,000,000 documents.

a term t. The reason to prefer df to cf is illustrated in Figure 6.3, where a sim-ple example shows that collection frequency (cf) and document frequency(df) can behave rather differently. In particular, the cf values for both ferrari

and insurance are roughly equal, but their df values differ significantly. Thissuggests that the few documents that do contain ferrari mention this term fre-quently, so that its cf is high but the df is not. Intuitively, we want such termsto be treated differently: the few documents that contain ferrari should geta significantly higher boost for a query on ferrari than the many documentscontaining insurance get from a query on insurance.

How is the document frequency df of a term used to scale its weight? De-noting as usual the total number of documents in a corpus by N, we definethe inverse document frequency (idf) of a term t as follows:INVERSE DOCUMENT

FREQUENCY

idft = logN

dft.(6.1)

Thus the idf of a rare term is high, whereas the idf of a frequent term islikely to be low. Figure 6.4 gives an example of idf’s in a corpus of 1,000,000documents; in this example logarithms are to the base 10.

Exercise 6.2

Why is the idf of a term always finite?

Exercise 6.3

What is the idf of a term that occurs in every document? Compare this with the useof stop word lists.

Page 26: Information Retrieval with Open Source

a arrived damaged delivery fire gold in of shipment silver truck

D1 0 0 0.477 0 0.477 0.176 0 0 0.176 0 0

D2 0 0.176 0 0.477 0 0 0 0 0 0.954 0.176

D3 0 0.176 0 0 0 0.176 0 0 0.176 0 0.176

Q 0 0 0 0 0 0.176 0 0 0 0.477 0.176

Page 27: Information Retrieval with Open Source

SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0)(0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0)(0)+(0)(0.176)+(0.477)(0)+(0.176)(0)=(0.176)(0.176) ⋲ 0.031

Page 28: Information Retrieval with Open Source

SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486

SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062

Page 29: Information Retrieval with Open Source

Inverted index

Page 30: Information Retrieval with Open Source

term - 1

term - 2

term - 3

term - 4

term - 5

term - n

(dn,1) (d10,1)

(dn,5) (dn,3)

(d2,11) (d10,1)

(dn,1) (d2,1)

(dn,2) (d4,3)

(d6,1) (d7,3)

Page 31: Information Retrieval with Open Source

Lucene

Page 32: Information Retrieval with Open Source

Analysis

Page 33: Information Retrieval with Open Source
Page 34: Information Retrieval with Open Source
Page 35: Information Retrieval with Open Source

Using the built-in analyzers 119

when you order the filtering process. Consider an analyzer that removes stop words and also injects synonyms into the token stream—it would be more effi-cient to remove the stop words first so that the synonym injection filter would have fewer terms to consider (see section 4.6 for a detailed example).

4.3 Using the built-in analyzers

Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.We’ll leave discussion of the two language-specific analyzers, RussianAnalyzerand GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper, PerFieldAnalyzerWrapper, to section 4.4.

The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple-Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple-Analyzer are both trivial and we don’t cover them in more detail here. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have non-trivial effects.

4.3.1 StopAnalyzerStopAnalyzer, beyond doing basic word splitting and lowercasing, also removes stop words. Embedded in StopAnalyzer is a list of common English stop words; this list is used unless otherwise specified:

public static final String[] ENGLISH_STOP_WORDS = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it",

Table 4.2 Primary analyzers available in Lucene

Analyzer Steps taken

WhitespaceAnalyzer Splits tokens at whitespace

SimpleAnalyzer Divides text at nonletter characters and lowercases

StopAnalyzer Divides text at nonletter characters, lowercases, and removes stop words

StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mail addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more; lowercases; and removes stop words

Licensed to Simon Wong <[email protected]>

Page 36: Information Retrieval with Open Source

Index

Page 37: Information Retrieval with Open Source

Index

• IndexWriter

• Directory

• Analyzer

• Document

• Field

Page 38: Information Retrieval with Open Source

storeIndex options: store

Value Description:no Don’t store field:yes Store field in its original format.

Use this value if you want to highlightmatches or print match excerpts a la Googlesearch.

:compressed Store field in compressed format.

Ruby Day Kraków: Full Text Search with Ferret

Page 39: Information Retrieval with Open Source

indexIndex options: index

Value Description:no Do not make this field searchable.:yes Make this field searchable and tok-

enize its contents.:untokenized Make this field searchable but do not

tokenize its contents. Use this valuefor fields you wish to sort by.

:omit norms Same as :yes except omit the normsfile. The norms file can be omit-ted if you don’t boost any fields andyou don’t need scoring based on fieldlength.

:untokenized omit norms Same as :untokenized except omit thenorms file.

Ruby Day Kraków: Full Text Search with Ferret

Page 40: Information Retrieval with Open Source

term_vectorIndex options: term vector

Value Description:no Don’t store term-vectors:yes Store term-vectors without storing positions

or o!sets.:with positions Store term-vectors with positions.:with o!sets Store term-vectors with o!sets.:with positions ofssets Store term-vectors with positions and o!-

sets.

Ruby Day Kraków: Full Text Search with Ferret

Page 41: Information Retrieval with Open Source
Page 42: Information Retrieval with Open Source

Search

Page 43: Information Retrieval with Open Source

Search

• IndexSearcher

• Term

• Query

• Hits

Page 44: Information Retrieval with Open Source

Query

Page 45: Information Retrieval with Open Source

Query

• API

• new TermQuery(new Term(“name”,”Tomek”));

• Lucene QueryParser

• queryParser.parse(“name:Tomek");

Page 46: Information Retrieval with Open Source

TermQuery

name:Tomek

Page 47: Information Retrieval with Open Source

BooleanQuery

ramobo OR ninja

+rambo +ninja –name:rocky

Page 48: Information Retrieval with Open Source

PhraseQuery“ninja java” –name:rocky

Page 49: Information Retrieval with Open Source

SloppyPhraseQuery

“red-faced politicians”~3

Page 50: Information Retrieval with Open Source

RangeQueryreleaseDate:[2000 TO 2007]

Page 51: Information Retrieval with Open Source

WildcardQuerysup?r, su*r, super*

Page 52: Information Retrieval with Open Source

FuzzyQuerycolor~

colour, collor, colro

Page 53: Information Retrieval with Open Source

http://en.wikipedia.org/wiki/Levenshtein_distance

color colour - 1

colour coller - 2

Page 54: Information Retrieval with Open Source

Equation 1. Levenstein Distance Score

This means that an exact match will have a score of 1.0, whereas terms with nocorresponding letters will have a score of 0.0. Since FuzzyQuery has a limit to thenumber of matching terms it can use, the lowest scoring matches get discarded ifthe FuzzyQuery becomes full.

Due to the way FuzzyQuery is implemented, it needs to enumerate every singleterm in its field’s index to find all valid similar terms in the dictionary. This cantake a long time if you have a large index. One way to prevent any performanceproblems is to set a minimum prefix length. This is done by settingthe :min_prefix_len parameter when creating the FuzzyQuery. This parameter isset to 0 by default hence the fact that it would need to enumerate every term inindex.

To minimize the expense of finding matching terms, we could set the minimumprefix length of the example query to 3. This would greatly reduce the number ofterms that need to be enumerated, and “color” would still match “colour,” al-though “cloor” would no longer match.# FQL: 'content:color~ ' => no way to set :min_prefix_length in FQLquery = FuzzyQuery.new(:content, "color", :max_terms => 1024, :min_prefix_length => 3)

You can also set a cut-off score for matching terms by setting the :min_similarityparameter. This will not affect how many terms are enumerated but it will affecthow many terms are added to the internal MultiTermQuery which can also helpimprove performance.# FQL: 'content:color~0.8 ' => no way to set :min_prefix_length in FQLquery = FuzzyQuery.new(:content, "color", :max_terms => 1024, :min_similarity => 0.8, :min_prefix_length => 3)

In some cases, you may want to change the default values for :min_prefix_lenand :min_similarity, particularly for use in the Ferret QueryParser. Simply set theclass variables in FuzzyQuery.FuzzyQuery.default_min_similarity = 0.8FuzzyQuery.default_prefix_length = 3

Ferret 42

Page 55: Information Retrieval with Open Source

Boosttitle:Spring^10

Page 56: Information Retrieval with Open Source
Page 57: Information Retrieval with Open Source