information retrieval with open source

Tomasz [email protected]

mailto:[email protected]

mailto:[email protected]

Information Retrieval

Retrieval strategies

• Vector Space Model

• Latent Semantic Indexing

• Probabilistic Retrieval Strategies

• Language Models

• Inference Networks

• Extended Boolean Retrieval

• Neural Networks

• Genetic Algorithms

• Fuzzy Set Retrieval

Vector space model

Text retrieval

Analysis

Tokenization

Stop-words

Stemming

Lemmatization

http://tartarus.org/~martin/PorterStemmer/

Document

Term

Term frequency

Inversion document frequency

Preliminary draft (c)�2007 Cambridge UP

6.2 Term frequency and weighting 91

Word cf dfferrari 10422 17insurance 10440 3997

! Figure 6.3 Collection frequency (cf) and document frequency (df) behave differ-ently.

term dft idft

calpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0

! Figure 6.4 Example of idf values. Here we give the idf’s of terms with variousfrequencies in a corpus of 1,000,000 documents.

a term t. The reason to prefer df to cf is illustrated in Figure 6.3, where a sim-ple example shows that collection frequency (cf) and document frequency(df) can behave rather differently. In particular, the cf values for both ferrari

and insurance are roughly equal, but their df values differ significantly. Thissuggests that the few documents that do contain ferrari mention this term fre-quently, so that its cf is high but the df is not. Intuitively, we want such termsto be treated differently: the few documents that contain ferrari should geta significantly higher boost for a query on ferrari than the many documentscontaining insurance get from a query on insurance.

How is the document frequency df of a term used to scale its weight? De-noting as usual the total number of documents in a corpus by N, we definethe inverse document frequency (idf) of a term t as follows:INVERSE DOCUMENT

FREQUENCY

idft = logN

dft.(6.1)

Thus the idf of a rare term is high, whereas the idf of a frequent term islikely to be low. Figure 6.4 gives an example of idf’s in a corpus of 1,000,000documents; in this example logarithms are to the base 10.

Exercise 6.2

Why is the idf of a term always finite?

Exercise 6.3

What is the idf of a term that occurs in every document? Compare this with the useof stop word lists.


92 6 Scoring and term weighting

6.2.2 Tf-idf weighting

We now combine the above expressions for term frequency and inverse doc-ument frequency, to produce a composite weight for each term in each doc-ument. The tf-idf weighting scheme assigns to term t a weight in documentd given by

tf-idft,d = tft,d ! idft.(6.2)

In other words, tf-idft,d assigns to term t a weight in document d that is

1. highest when t occurs many times within a small number of documents(thus lending high discriminating power to those documents);

2. lower when the term occurs fewer times in a document, or occurs in manydocuments (thus offering a less pronounced relevance signal);

3. lowest when the term occurs in virtually all documents.

At this point, we may view each document as a vector with one componentcorresponding to each term, together with a weight for each component thatis given by (6.2). This vector form will prove to be crucial to scoring andranking; we will develop these ideas in Chapter 7. As a first step, we intro-duce the overlap score measure: the score of a document d is the sum, over allquery terms, of the number of times each of the query terms occurs in d. Wecan refine this idea so that we add up not the number of occurrences of eachquery term t in d, but instead the tf-idf weight of each term in d.

Score(q, d) = !t"q

tf-idft,d.(6.3)

Exercise 6.4

Can the tf-idf weight of a term in a document exceed 1?

Exercise 6.5

How does the base of the logarithm in (6.1) affect the score calculation in (6.3)? Howdoes the base of the logarithm affect the relative scores of two documents on a givenquery?

Exercise 6.6

If the logarithm in (6.1) is computed base 2, suggest a simple approximation to the idfof a term.

6.3 Variants in weighting functions

A number of alternative schemes to tf and tf-idf have been considered; wediscuss some of the principal ones here.

Search


100 7 Vector space retrieval

!

"

!!

!!

!!

!!

!!"

#############$

%%%%%%%%%%%%%%&

!v(q)

!v(d2)

!v(d2)

! Figure 7.1 Cosine similarity illustrated.

John” and “John is quicker than Mary” are identical in such a bag of wordsrepresentation.

How do we quantify the similarity between two documents in this vectorspace? A first attempt might consider the magnitude of the vector differencebetween two document vectors. This measure suffers from a drawback: twodocuments with very similar term distributions can have a significant vectordifference simply because one is much longer than the other. Thus the rel-ative distributions of terms may be identical in the two documents, but theabsolute term frequencies of one may be far larger.

To compensate for the effect of document length, the standard way ofquantifying the similarity between two documents d1 and d2 is to computethe cosine similarityof their vector representations !V(d1) and !V(d2)COSINE SIMILARITY

sim(d1, d2) =!V(d1) · !V(d2)

|!V(d1)||!V(d2)|,(7.1)

where the numerator represents the inner product (also known as the dotproduct) of the vectors !V(d1) and !V(d2), while the denominator is the prod-ucts of their lengths. The effect of the denominator is to normalize the vec-tors !V(d1) and !V(d2) to unit vectors !v(d1) = !V(d1)/|!V(d1)| and !v(d2) =!V(d2)/|!V(d2)|. We can then rewrite (7.1) as

sim(d1, d2) = !v(d1) ·!v(d2).(7.2)

Thus, (7.2) can be viewed as the inner product of the normalized versions ofthe two document vectors. What use is the similarity measure sim(d1, d2)?Given a document d (potentially one of the di in the collection), consider

Q: “gold silver truck”

D1: “Shipment of gold damaged in a fire”

D2: “Delivery of silver arrived in a silver truck”

D3: “Shipment of gold arrived in a truck”

a arrived damaged delivery fire gold in of shipment silver truck

D1 1 0 1 0 1 1 1 1 1 0 0

D2 1 1 0 1 0 0 1 1 0 2 0

D3 1 1 0 0 0 1 1 1 1 0 1

Q 0 0 0 0 0 1 0 0 0 1 1

TF

• a log 3/3 = 0

• arrived log 3/2 = 0.176

• damaged log 3/1 = 0.477

• delivery log 3/1 = 0.477

• fire log 3/1 = 0.477

• in log 3/3 = 0

• of log 3/3 = 0

• silver log 3/1 = 0.477

• shipment log 3/2 = 0.176

• truck log 3/2 = 0.176

• gold log 3/2 = 0.176


6.2 Term frequency and weighting 91

Word cf dfferrari 10422 17insurance 10440 3997

! Figure 6.3 Collection frequency (cf) and document frequency (df) behave differ-ently.

term dft idft

calpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0

! Figure 6.4 Example of idf values. Here we give the idf’s of terms with variousfrequencies in a corpus of 1,000,000 documents.

a term t. The reason to prefer df to cf is illustrated in Figure 6.3, where a sim-ple example shows that collection frequency (cf) and document frequency(df) can behave rather differently. In particular, the cf values for both ferrari

and insurance are roughly equal, but their df values differ significantly. Thissuggests that the few documents that do contain ferrari mention this term fre-quently, so that its cf is high but the df is not. Intuitively, we want such termsto be treated differently: the few documents that contain ferrari should geta significantly higher boost for a query on ferrari than the many documentscontaining insurance get from a query on insurance.

How is the document frequency df of a term used to scale its weight? De-noting as usual the total number of documents in a corpus by N, we definethe inverse document frequency (idf) of a term t as follows:INVERSE DOCUMENT

FREQUENCY

idft = logN

dft.(6.1)

Thus the idf of a rare term is high, whereas the idf of a frequent term islikely to be low. Figure 6.4 gives an example of idf’s in a corpus of 1,000,000documents; in this example logarithms are to the base 10.

Exercise 6.2

Why is the idf of a term always finite?

Exercise 6.3

What is the idf of a term that occurs in every document? Compare this with the useof stop word lists.

a arrived damaged delivery fire gold in of shipment silver truck

D1 0 0 0.477 0 0.477 0.176 0 0 0.176 0 0

D2 0 0.176 0 0.477 0 0 0 0 0 0.954 0.176

D3 0 0.176 0 0 0 0.176 0 0 0.176 0 0.176

Q 0 0 0 0 0 0.176 0 0 0 0.477 0.176

SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0)(0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0)(0)+(0)(0.176)+(0.477)(0)+(0.176)(0)=(0.176)(0.176) ⋲ 0.031

SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486

SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062

Inverted index

term - 1

term - 2

term - 3

term - 4

term - 5

term - n

(dn,1) (d10,1)

(dn,5) (dn,3)

(d2,11) (d10,1)

(dn,1) (d2,1)

(dn,2) (d4,3)

(d6,1) (d7,3)

Lucene

Analysis

Using the built-in analyzers 119

when you order the filtering process. Consider an analyzer that removes stop words and also injects synonyms into the token stream—it would be more effi-cient to remove the stop words first so that the synonym injection filter would have fewer terms to consider (see section 4.6 for a detailed example).

4.3 Using the built-in analyzers

Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.We’ll leave discussion of the two language-specific analyzers, RussianAnalyzerand GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper, PerFieldAnalyzerWrapper, to section 4.4.

The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple-Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple-Analyzer are both trivial and we don’t cover them in more detail here. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have non-trivial effects.

4.3.1 StopAnalyzerStopAnalyzer, beyond doing basic word splitting and lowercasing, also removes stop words. Embedded in StopAnalyzer is a list of common English stop words; this list is used unless otherwise specified:

public static final String[] ENGLISH_STOP_WORDS = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it",

Table 4.2 Primary analyzers available in Lucene

Analyzer Steps taken

WhitespaceAnalyzer Splits tokens at whitespace

SimpleAnalyzer Divides text at nonletter characters and lowercases

StopAnalyzer Divides text at nonletter characters, lowercases, and removes stop words

StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mail addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more; lowercases; and removes stop words

Licensed to Simon Wong <[email protected]>

Index

• IndexWriter

• Directory

• Analyzer

• Document

• Field

storeIndex options: store

Value Description:no Don’t store field:yes Store field in its original format.

Use this value if you want to highlightmatches or print match excerpts a la Googlesearch.

:compressed Store field in compressed format.

Ruby Day Kraków: Full Text Search with Ferret

indexIndex options: index

Value Description:no Do not make this field searchable.:yes Make this field searchable and tok-

enize its contents.:untokenized Make this field searchable but do not

tokenize its contents. Use this valuefor fields you wish to sort by.

:omit norms Same as :yes except omit the normsfile. The norms file can be omit-ted if you don’t boost any fields andyou don’t need scoring based on fieldlength.

:untokenized omit norms Same as :untokenized except omit thenorms file.


term_vectorIndex options: term vector

Value Description:no Don’t store term-vectors:yes Store term-vectors without storing positions

or o!sets.:with positions Store term-vectors with positions.:with o!sets Store term-vectors with o!sets.:with positions ofssets Store term-vectors with positions and o!-

sets.


Search

Search

• IndexSearcher

• Term

• Query

• Hits

Query

• API

• new TermQuery(new Term(“name”,”Tomek”));

• Lucene QueryParser

• queryParser.parse(“name:Tomek");

TermQuery

name:Tomek

BooleanQuery

ramobo OR ninja

+rambo +ninja –name:rocky

PhraseQuery“ninja java” –name:rocky

SloppyPhraseQuery

“red-faced politicians”~3

RangeQueryreleaseDate:[2000 TO 2007]

WildcardQuerysup?r, su*r, super*

FuzzyQuerycolor~

colour, collor, colro

http://en.wikipedia.org/wiki/Levenshtein_distance

color colour - 1

colour coller - 2



Equation 1. Levenstein Distance Score

This means that an exact match will have a score of 1.0, whereas terms with nocorresponding letters will have a score of 0.0. Since FuzzyQuery has a limit to thenumber of matching terms it can use, the lowest scoring matches get discarded ifthe FuzzyQuery becomes full.

Due to the way FuzzyQuery is implemented, it needs to enumerate every singleterm in its field’s index to find all valid similar terms in the dictionary. This cantake a long time if you have a large index. One way to prevent any performanceproblems is to set a minimum prefix length. This is done by settingthe :min_prefix_len parameter when creating the FuzzyQuery. This parameter isset to 0 by default hence the fact that it would need to enumerate every term inindex.

To minimize the expense of finding matching terms, we could set the minimumprefix length of the example query to 3. This would greatly reduce the number ofterms that need to be enumerated, and “color” would still match “colour,” al-though “cloor” would no longer match.# FQL: 'content:color~ ' => no way to set :min_prefix_length in FQLquery = FuzzyQuery.new(:content, "color", :max_terms => 1024, :min_prefix_length => 3)

You can also set a cut-off score for matching terms by setting the :min_similarityparameter. This will not affect how many terms are enumerated but it will affecthow many terms are added to the internal MultiTermQuery which can also helpimprove performance.# FQL: 'content:color~0.8 ' => no way to set :min_prefix_length in FQLquery = FuzzyQuery.new(:content, "color", :max_terms => 1024, :min_similarity => 0.8, :min_prefix_length => 3)

In some cases, you may want to change the default values for :min_prefix_lenand :min_similarity, particularly for use in the Ferret QueryParser. Simply set theclass variables in FuzzyQuery.FuzzyQuery.default_min_similarity = 0.8FuzzyQuery.default_prefix_length = 3

Ferret 42

Boosttitle:Spring^10

information retrieval with open source

Technology

term frequency

document term

term t

yesstore termvectors

dont store termvectors

positions store termvectors

osetsstore termvectors

term vectorvalue description