information retrieval and web search ir models: boolean model instructor: rada mihalcea class web...

24
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: http://www.cs.unt.edu/~rada/CSCE5300

Upload: ira-pierce

Post on 21-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Information Retrieval and Web Search

IR models: Boolean model

Instructor: Rada MihalceaClass web page: http://www.cs.unt.edu/~rada/CSCE5300

Page 2: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 2

Today’s topics

•Boolean retrieval

•Improvements / Variations of the boolean model– Extended boolean model– Fuzzy information retrieval

Page 3: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 3

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

boolean vector probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

Page 4: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 4

The Boolean Model•Simple model based on set theory

•Queries specified as boolean expressions – precise semantics– neat formalism– q = ka (kb kc)

•Terms are either present or absent. Thus, wij {0,1}

•Consider– q = ka (kb kc)– vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)– vec(qcc) = (1,1,0) is a conjunctive component

•Each query can be transformed in DNF form

Page 5: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 5

The Boolean Model

•q = ka (kb kc)

•sim(q,dj) = 1, if document satisfies the boolean query

0 otherwise

- no in-between, only 0 or 1

(1,1,1)(1,0,0)

(1,1,0)

Ka Kb

Kc

Page 6: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 6

Exercise

•D1 = “computer information retrieval”

•D2 = “computer retrieval”

•D3 = “information”

•D4 = “computer information”

•Q1 = “information retrieval”

•Q2 = “information ¬computer”

Page 7: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 7

Exercise

0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift

12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Page 8: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 8

Drawbacks of the Boolean Model•Retrieval based on binary decision criteria with no notion of

partial matching

•No ranking of the documents is provided (absence of a grading scale)

•Information need has to be translated into a Boolean expression which most users find awkward

•The Boolean queries formulated by the users are most often too simplistic

•As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

Page 9: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 9

•The Boolean model imposes a binary criterion for deciding relevance

•The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past

•Two extensions of boolean model:– Fuzzy Set Model– Extended Boolean Model

Page 10: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 10

Fuzzy Set Model

•Queries and docs represented by sets of index terms: matching is approximate from the start

•This vagueness can be modeled using a fuzzy framework, as follows:– with each term is associated a fuzzy set– each doc has a degree of membership in this fuzzy set

•This interpretation provides the foundation for many models for IR based on fuzzy theory

•In here, the model proposed by Ogawa, Morita, and Kobayashi (1991)

Page 11: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 11

Fuzzy Set Theory

•Framework for representing classes whose boundaries are not well defined

•Key idea is to introduce the notion of a degree of membership associated with the elements of a set

•This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership

•Thus, membership is now a gradual notion, contrary to the notion enforced by classic Boolean logic

Page 12: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 12

Fuzzy Set Theory•Definition

– A fuzzy subset A of U is characterized by a membership function (A,u) : U [0,1]

which associates with each element u of U a number (u) in the interval [0,1]

•Definition– Let A and B be two fuzzy subsets of U. Also, let ¬A be the

complement of A. Then,• (¬A,u) = 1 - (A,u) • (AB,u) = max((A,u), (B,u))• (AB,u) = min((A,u), (B,u))

Page 13: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 13

Fuzzy Information Retrieval

•Fuzzy sets are modeled based on a thesaurus

•This thesaurus is built as follows:– Let vec(c) be a term-term correlation matrix– Let c(i,l) be a normalized correlation factor for (ki,kl):

c(i,l) = n(i,l) ni + nl - n(i,l)

- ni: number of docs which contain ki- nl: number of docs which contain kl- n(i,l): number of docs which contain both ki and kl

•We now have the notion of proximity among index terms.

Page 14: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 14

Fuzzy Information Retrieval

•The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows:

(i,j) = 1 - (1 - c(i,l)) kl dj

(i,j) : membership of doc dj in fuzzy subset associated with ki

•The above expression computes an algebraic sum over all terms in the doc dj

•A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki

Page 15: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 15

Fuzzy Information Retrieval

• (i,j) = 1 - (1 - c(i,l)) kl dj

(i,j) : membership of doc dj in fuzzy subset associated with ki

•If doc dj contains a term kl which is closely related to ki, we have– c(i,l) ~ 1 (i,j) ~ 1– index ki is a good fuzzy index for doc

Page 16: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 16

Fuzzy IR: An Example

•q = ka (kb kc)

•vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0) = vec(cc1) + vec(cc2) + vec(cc3)

(q,dj) = (cc1+cc2+cc3,j) = 1 - (1 - (a,j) (b,j) (c,j)) * (1 - (a,j) (b,j) (1-(c,j))) * (1 - (a,j) (1-(b,j)) (1-(c,j)))

cc1cc3

cc2

Ka Kb

Kc

Page 17: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 17

Fuzzy Information Retrieval

•Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory

•Experiments with standard test collections are not available

•Difficult to compare at this time

Page 18: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 18

Extended Boolean Model

•Boolean model is simple and elegant.

•But, no provision for a ranking

•As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership

•Extend the Boolean model with the notions of partial matching and term weighting

•Combine characteristics of the Vector model with properties of Boolean algebra

Page 19: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 19

The Idea

•The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra

•Let,– q = kx ky– Use weights associated with kx and ky– In boolean model: wx = wy = 1; all other documents are

irrelevant

Page 20: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 20

The Idea: •qand = kx ky; wxj = x and wyj = y

dj

dj+1

y = wyj

x = wxj(0,0)

(1,1)

kx

ky

sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2

2 2

AND

We want a document to beas close as possible to (1,1)

Page 21: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 21

The Idea: •qor = kx ky; wxj = x and wyj = y

dj

dj+1

y = wyj

x = wxj(0,0)

(1,1)

kx

ky

sim(qor,dj) = sqrt( x + y ) 2

2 2

OR

We want a document to beas far as possible from (0,0)

Page 22: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 22

Generalizing the Idea

•We can extend the previous model to consider Euclidean distances in a t-dimensional space

•This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter

•A generalized conjunctive query is given by– qor = k1 k2 . . . kt

•A generalized disjunctive query is given by – qand = k1 k2 . . . kt

ppp

p p p

Page 23: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 23

Generalizing the Idea

– sim(qand,dj) = 1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m

pp p p1

– sim(qor,dj) = (x1 + x2 + . . . + xm ) m

pp p p1

– sim(qor,dj) = (x1 + x2 + . . . + xm ) m – If p = 1 then (Vector like)

• sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm m

pp p p1

Page 24: Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: rada/CSCE5300

Slide 24

Conclusions

•Model is quite powerful

•Properties are interesting and might be useful

•Computation is somewhat complex

•However, distributivity operation does not hold for ranking computation:– q1 = (k1 k2) k3– q2 = (k1 k3) (k2 k3)– sim(q1,dj) sim(q2,dj)