modern information retrieval chapter 1: introduction

Modern Information Retrieval

Chapter 1: Introduction

Ricardo Baeza-YatesBerthier Ribeiro-Neto

Motivation

Example of the user information need Topic: NCAA college tennis team Description: Find all the pages (documents) containing information on

college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament.

Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.

IR Research

Information retrieval vs Data retrieval

Research information search information filtering (routing) document classification and categorization user interfaces and data visualization cross-language retrieval

IR History

1990, WWW

The User Task

Retrieval (Searching) classic information search process where clear

objectives are defined Browsing

a process where one’s main objectives are not clearly defined and might change during the interaction with the system

Logical View of the Documents

Text Operations reduce the complexity of the document representation a full text a set of index terms

Steps1. Stopwords removing2. Stemming3. Noun groups4. ...

Past, Present, and Future

Early Development Index

Library Author name, title, subject headings, keywords

The Web and Digital Libraries Hyperlinks

Conventional Text-Retrieval Systems

Automatic Text Processing

G. Salton, Addison-Wesley, 1989.(Chapter 9)

Data Retrieval

A specified set of attributes is used to characterize each record.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)

Exact match between the attributes used inquery formulations and those attached to the document.

SELECT BDATE, ADDRFROM EMPLOYEEWHERE NAME = ‘John Smith’

Text-Retrieval Systems

Content identifiers (keywords, index terms, descriptors) characterize the stored texts.

Degrees of coincidence between the sets of identifiers attached to queries and documents

content analysisquery formulation

Possible Representation

Document representation (Text operation) unweighted index terms (term vectors) weighted index terms …

Query (Query operation) unweighted or weighted index terms Boolean combinations (or, and, not) …

Search operation must be effective (Indexing)

File Structures

Main requirements fast-access for various kinds of searches large number of indices

Alternatives Inverted Files Signature Files PAT trees

Inverted Files File is represented as an array of indexed documents.

Term 1 Term 2 Term 3 Term 4

Doc 1 1 1 0 1

Doc 2 0 1 1 1

Doc 3 1 0 1 1

Doc 4 0 0 1 1

Inverted-file process The document-term array is inverted (transposed).

Doc 1 Doc 2 Doc 3 Doc 4

Term 1 1 0 1 0

Term 2 1 1 0 0

Term 3 0 1 1 1

Term 4 1 1 1 1

Inverted-file process (Continued)

Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers.

Ex: Query= (term2 and term3)

term2 1 1 0 0term3 0 1 1 1------------------------------------------------------

1 <-- D2

List-merging for two ordered lists

The inverted-index operations to obtain answers are based on list-merging process.

ExampleT1: {D1, D3}T2: {D1, D2}Merged(T1, T2): {D1, D1, D2, D3}

Extensions of Inverted Index Operations(Distance Constraints)

Distance Constraints (A within sentence B)

terms A and B must co-occur in a common sentence

(A adjacent B)terms A and B must occur adjacently in the text

Implementation include term-location in the inverted indexes

information: {P345, P348, P350, …}retrieval: {P123, P128, P345, …}

include sentence-location in the indexes information:

{P345, 25; P345, 37; P348, 10; P350, 8; …} retrieval:

{P123, 5; P128, 25; P345, 37; P345, 40; …}

Include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {P345, 2, 3, 5; …}retrieval: {P345, 2, 3, 6; …}

Query examples(information adjacent retrieval)(information within five words retrieval)

Cost: the size of indexes

Retrieval models

Classic Models

BooleanVector

Probabilistic

FuzzyExtended Boolean

Set Theoretic

AlgebraicGeneralized Vector

Latent Semantic IndexNeural Networks

Inference NetworkBelief Network

Probabilistic

Classic IR Model

Basic concepts : Each document is described by a set of representative keywords called index terms.

Assign a numerical weights to distinct relevance between index terms.

Boolean model

Binary decision criterion Data retrieval model Advantage

clean formalism, simplicity Disadvantage

It is not simple to translate an information need into a Boolean expression.

exact matching may lead to retrieval of too few or too many documents

Vector model

Assign non-binary weights to index terms in queries and in documents. => TFxIDF

Compute the similarity between documents and query. => Sim(Dj, Q)

More precise than Boolean model.

Term Weights

Term WeightsDi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}

Issues How to generate the term weights? How to apply the term weights?

• Sum the weights of all document terms that match the given query.

• Rank the output documents in the descending order of term weight.

Boolean Query with Term Weights

Transform a Boolean expression into disjunctive normal form.

T1 and (T2 or T3)= (T1 and T2) or (T1 and T3)

For each conjunct, compute the minimum term weight of any document term in that conjunct.

The document weight is the maximum of all the conjunct weights.

Boolean Query with Term Weights

Example: Q=(T1 and T2) or T3Document Conjunct QueryVectors Weights Weight

(T1 and T2) (T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6)

0.2 0.6 0.6D2=(T1,0.7;T2,0.2;T3,0.1)

0.2 0.1 0.2D1 is preferred.

Summary

Conventional IR systems Evaluation Text operations (Term selection) Query operations (Pattern matching, Relevance

feedback) Indexing (File structure) Modeling

Resources

Journals Journal of American Society of Information Sciences ACM Transactions on Information Systems Information Processing and Management Information Systems (Elsevier) Knowledge and Information Systems (Springer)

Conferences ACM SIGIR, DL, CIKM, CHI, etc. Text Retrieval Conference (TREC)

modern information retrieval chapter 1: introduction

Documents

modern information retrieval: week 3 probabilistic model

modern information...

modern information retrieval chapter 7: text processing

modern information...

modern information retrieval chapter 5 query operations...

1 searching the web baeza-yates modern information...

modern information retrieval chapter 8 – indexing and...

modern information retrieval - pompeu fabra...

1 modern information retrieval chapter 1: introduction...

chapter 4 : query languages baeza-yates, 1999 modern...

chapter 5: query operations baeza-yates, 1999 modern...

modern information...

chapter 3 retrieval evaluation modern information retrieval...

modern information...

modern information retrieval -chapter 1.pdf

modern information retrieval chapter 7: text operations

modern information retrieval -...

modern information retrieval chapter 5 query operations

modern information retrieval chapter 7: text operations...

modern information retrieval -...