information retrieval ( ir )

50
E.G.M Petrakis Multimedia Information Retrieval 1 Information Retrieval (IR) Deals with the representation, storage and retrieval of unstructured data Topics of interest: systems, languages, retrieval, user interfaces, data visualization, distributed data sets Classical IR deals mainly with text The evolution of multimedia databases and of the web have given new interest to IR

Upload: yan

Post on 22-Feb-2016

49 views

Category:

Documents


2 download

DESCRIPTION

Information Retrieval ( IR ). Deals with the representation, storage and retrieval of unstructured data Topics of interest : systems, languages, retrieval, user interfaces, data visualization, distributed data sets Classical IR deals mainly with text - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

1

Information Retrieval (IR)Deals with the representation, storage

and retrieval of unstructured dataTopics of interest: systems, languages,

retrieval, user interfaces, data visualization, distributed data sets

Classical IR deals mainly with textThe evolution of multimedia databases

and of the web have given new interest to IR

Page 2: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

2

Multimedia IRA multimedia information system

that can store and retrieve attributes text2D grey-scale and color images 1D time series digitized voice or music video

Page 3: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

3

ApplicationsFinancial, marketing: stock prices, sales etc.

find companies whose stock prices move similarlyScientific databases: sensor data

whether, geological, environmental dataOffice automation, electronic encyclopedias,

electronic booksMedical databases X-rays, CT, MRI scans Criminal investigation suspects, fingerprints,Personal archives, text and color images

Page 4: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

4

QueriesFormulation of user’s information need

Free-text queryBy example document (e.g., text, image) e.g., in a collection of color photos find

those showing a tree close to a house Retrieval is based on the understanding

of the content of documents and of their components

Page 5: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

5

Goals of RetrievalAccuracy: retrieve documents that

the user expects in the answer With as few incorrect answers as

possibleAll relevant answers are retrieved

Speed: retrievals has to be fast The system responds in real time

Page 6: Information Retrieval ( IR )

Accuracy of RetrievalDepends on query criteria, query

complexity and specificityAttributes / Content criteria?

Depends on what is matched with the query How documents content is representedTypically feature vectors

Depends also on matching function Euclidean distance

E.G.M Petrakis Multimedia Information Retrieval

6

Page 7: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

7

Speed of RetrievalThe query is compared (matched) with all

stored documentsThe definition of similarity criteria (similarity

or distance function between documents) is an important issue: Similarity or distance function between

documentsMatching has to fast: document matching has to

be computationally efficient and sequential searching must be avoided

Indexing: search documents that are likely to match the query

Page 8: Information Retrieval ( IR )

Doc1-FeactureVectorDoc2-FeactureVector

DocN-FeatureVector

Database

E.G.M Petrakis Multimedia Information Retrieval

8

databasedescriptions documents

Doc1

Doc2

DocN

index

query

Page 9: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

9

Main IdeaDescriptions are extracted and stored

Same or separate storage with documentsQueries address the stored descriptions

rather the documents themselves Images & video: the majority of stored data

Solution: retrieval by text Text retrieval is a well researched areaMost systems work with text, attributes

Page 10: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

10

ProblemText descriptions for images and

video contained in documents are not available e.g., the text in a web site is not always

descriptive of every particular image contained in the web site

Two approacheshuman annotationsfeature extraction

Page 11: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

11

Human Annotations Humans insert attributes, captions for

images and videos inconsistent or subjective over time and

users (different users do not give the same descriptions)

retrievals will fail if queries are formulated using different keywords or descriptions

time consuming process, expensive or impossible for large databases

Page 12: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

12

Feature Extraction Features are extracted from audio,

image and video cheaper and replicable approach consistent descriptions but inexact low level feature (patterns, colors etc.) difficult to extract meaning different techniques for different data

Page 13: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

13

Furht at.al. 96

Architecture of a IR System for Images

Page 14: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

14

Multimedia IR in PracticeRelies on text descriptions Non-text IR

there is nothing similar to text-based retrieval systems

automatic IR is sometimes impossiblerequires human intervention at some

levelsignificant progress have been madedomain specific interpretations

Page 15: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

15

Ranking of IR MethodsRanking in terms of complexity and

accuracy attributestextaudioimagevideo

Combined IR based on two or more data types for more accurate retrievals

Complex data types (e.g.,video) are rich sources of information

accuracy complexity

Page 16: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

16

Similarity SearchingIR is not an exact process

two documents are never identical!they can be “similar” searching must be “approximate”

The effectiveness of IR depends on thetypes and correctness of descriptions usedtypes of queries alloweduser uncertainly as to what he is looking forefficiency of search techniques

Page 17: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

17

Approximate IRA query is specified and all documents

up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity

Two common types of similarity queries:range queries: retrieve all documents up to

a distance “threshold” Tnearest-neighbor queries: retrieve the k

best matches

Page 18: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

18

Distance - SimilarityDecide whether two documents are

similardistance: the lower the distance, the

more similar the documents are similarity: the higher the similarity, the

more similar the documents areKey issue for successful retrieval: the

more accurate the descriptions are, the more accurate the retrieval is

Page 19: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

19

Retrieval Quality CriteriaRetrieval with as few errors as possible Two types of errors:

false dismissals or misses: qualifying but non retrieved documents

false positives or false drops: retrieved but not qualifying documents

A good method minimizes both Ranking quality: retrieve qualifying

documents before non-qualifying ones

Page 20: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

20

Evaluation of Retrieval“Given a collection of documents, a set of

queries and human expert’s responses to the above queries, the ideal system will retrieve exactly what the human dictated”

The deviations from the above are measured

collection the in relevant or number totalanswer in relevant retrieved or number=

retrieved of numberanswer in relevant retrieved of number=

recall

precision

Page 21: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

21

Precision-Recall Diagram

Measure precision-recall for 1, 2 ..N answers High precision means few false alarms High recall mean few false dismissals

prec

ision

recall

1

1

0,5

0,5

the ideal method

Page 22: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

22

Harmonic Mean A single measure combining precision and recall

r: recall, p: precision F takes values in [0,1] F 1 as more retrieved documents are relevant F 0 as few retrieved documents are relevant F is high when both precision and recall are high F expresses a compromise between precision

and recall

prF 11 +

2=

Page 23: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

23

Ranking QualityThe higher the Rnorm the better the ability of a

method to retrieve correct answer before incorrect

A human expert evaluates the answer setThen take answers in pairs: (relevant,irrelevant)

S+ : pairs ranked correctly (the relevant entry has retrieved before the irrelevant one)

S- : pairs ranked incorrectly Smax

+: total ranked pairs

otherwiseSif

R SSS

norm 10 )1( max

-21

max

-

Page 24: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

24

Searching MethodSequential scanning: the query is

matched with all stored documentsslow if the database is large or if matching is slow

The documents must be indexedhashing, inverted files, B-trees, R-treesdifferent data types are indexed and

searched separately

Page 25: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

25

Goals of IRSummarizing, there are two general

goals common to all IR systemseffectiveness: IR must be accurate

(retrieves what the user expects to see in the answer)

efficiency: IR must be fast (faster than sequential scanning)

Page 26: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

26

Approximate IRA query is specified and all documents

up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity

Two common types of similarity queries:range queries: retrieve all documents up to

a distance “threshold” Tnearest-neighbor queries: retrieve the k

best matches

Page 27: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

27

Query FormulationMust be flexible and convenient SQL queries constraints on all attributes

and data types may become very complex

Queries by example e.g., by providing an example document or image

Browsing: display headers, summaries, miniatures etc. or for refining the retrieved results

Page 28: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

28

Access Methods for TextAccess methods for text are

interesting for at least 3 reasonsmultimedia documents contain text, e.g.,

images often have captionstext retrieval has several applications in

itself (library automation, web search etc.)text retrieval research has led to useful

ideas like, vector space model, information filtering, relevance feedback etc.

Page 29: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

29

Text Queries Single or multiple keyword queries

context queries: phrases, word proximity boolean: keywords with and, or, not

Natural language: free text queries Structured search takes also text

structure into account and can be: flat or hierarchical for searching in titles,

paragraphs, sections, chapters hypertext: combines content-connectivity

Page 30: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

30

Keyword MatchingThe whole collection is searched

no preprocessingno space overheadupdates are easy

The user specifies a string (regular expression) and the text is parsed using a finite state automatonKMP, BMH algorithms

Page 31: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

31

Error Tolerant MethodsMethods that can tolerate errorsScan text one character at a time by

keeping track of matched characters and retrieve strings within a desired editing distance from queryWu, Manber 92, Baeza-Yates, Connet 92

Extension: regular expressions built-up by strings and operators: “pro (blem | tein) (s | ε) | (0 | 1 | 2)”

Page 32: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

32

Error Counting MethodsRetrieve similar words or phrases e.g.,

misspelled words or words with different pronunciation

editing distance: a numerical estimate of the similarity between 2 strings

phonetic coding: search words with similar pronunciation (Soundex, Phonix)

N-grams: count common N-length substrings

Page 33: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

33

Editing DistanceMinimum number of edit operations that

are needed to transform the first string to the secondedit operation: insert, delete, substitute and

(sometimes) transposition of charactersd(si,ti) : distance between characters

usually d(si,ti) = 1 if si < > ti and 0 otherwiseedit(cordis,codis) = 1: r is deletededit(cordis, codris) = 1: transposition of rd

Page 34: Information Retrieval ( IR )

E.G.M. Petrakis Multimedia Information Retrieval

34

i 0 1 2 3 4 5 6j a b b c d d0 0 1 2 3 4 5 61 a 1 0 1 2 3 4 52 a 2 1 1 2 3 4 53 a 3 2 2 2 3 4 54 b 4 3 2 2 3 4 55 c 5 4 3 2 2 3 46 c 6 5 4 4 3 3 47 c 7 6 5 5 4 4 48 d 8 7 6 6 5 4 4

initialization cost

total cost

Page 35: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

35

Damerau-Levenstein AlgorithmBasic recurrence relation (DP algorithm)

edit(0,0) = 0;edit(i,0) = i; edit(0,j) = j;edit(i,j) = min{ edit(i-1,j) + 1, edit(i, j-1) + 1, edit(i-1,j-1) + d(si,ti), edit(i-2,j-2) + d(si,tj-1) + d(si-1,sj) + 1 }

Page 36: Information Retrieval ( IR )

E.G.M. Petrakis Multimedia Information Retrieval

36

Matching Algorithm Compute D(A,B)

#A, #B lengths of A, B 0: null symbol R: cost of an edit operationD(0,0) = 0for i = 0 to #A: D(i:0) = D(i-1,0) + R(A[i]0);for j = 0 to #B: D(0:j) = D(0,j-1) + R(0B[j]);for i = 0 to #A for j = 0 to #B {

1. m1 = D(i,j-1) + R(0B[j]);2. m2 = D(i-1,j) + R(A[i] 0); 3. m3 = D(i-1,j-1) + R(A[i] B[j]);4. D(i,j) = min{m1, m2, m3};

}

Page 37: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

37

Classical IR ModelsBoolean model

simple based on set theoryqueries as Boolean expressions

Vector space modelqueries and documents as vectors in

term spaceProbabilistic model

a probabilistic approach

Page 38: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

38

Text IndexingA document is represented by a set of

index terms that summarize document contents

adjectives, adverbs, connectives are less useful

mainly nouns (lexicon look-up)requires text pre-processing (off-line)

Page 39: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

39

Indexing

index

Data Repository

Page 40: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

40

Text PreprocessingExtract index terms from textProcessing stages

word separation, sentence splittingchange terms to a standard form (e.g.,

lowercase)eliminate stop-words (e.g. and, is, the, …)reduce terms to their base form (e.g.,

eliminate prefixes, suffixes)construct mapping between terms and

documents (indexing)

Page 41: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

41

Text Preprocessing Chart

from Baeza – Yates & Ribeiro – Neto, 1999

Page 42: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

42

Text Indexing MethodsMain indexing methods:

inverted files signature filesbitmaps

Size of indexmay exceed size of actual collectioncompress index to reduce storage typically 40% of size of collection

Page 43: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

43

Inverted FilesDictionary: terms stored alphabeticallyPosting list: pointers to documents

containing the termPosting info: document id, frequency of

occurrence, location in document, etc.dictionary indexed by B-trees, tries or binary

searchedPros: fast and easy to implementCons: large space overhead (up to

300%)

Page 44: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

44

Inverted Index

άγαλμααγάπη…δουλειά…πρωί…ωκεανός

index posting list

(1,2)(3,4)

(4,3)(7,5)

(10,3)

123456789

1011

………

documents

Page 45: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

45

Query Processing1. Parse query and extract query terms2. Dictionary lookup

binary search3. Get postings from inverted file

one term at a time4. Accumulate postings

record matching document ids, frequencies, etc. compute weights

5. Rank answers by weights (e.g., frequencies)

Page 46: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

46

Signature FilesA filter for eliminating most of the non-

qualifying documentssignature: short hash-coded representation

of document or querythe signatures are stored and searched

sequentiallysearch returns all qualifying documents plus

some false alarmsthe answer is searched to eliminate false

alarms

Page 47: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

47

Superimposed CodingEach word yields a bit pattern of size F with

m bits set to 1 and the rest left as 0size of F affects the false drop ratebit patterns are OR-ed to form the doc signaturefind documents having 1 in the same bits as the

queryword signaturedata 001 000 110 010base 000 010 101 001document signature 001 010 111 011

Page 48: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

48

BitmapsFor every term, a bit vector is stored

each bit corresponds to a documentset to 1 if term appears in document, 0

otherwisee.g., “word” 10100: the “word” appears in

documents 1 and 3 in a collection of 5 documents

Very efficient for Boolean queriesretrieve bit-vectors for each query term combine bit vectors with Boolean operators

Page 49: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

49

Comparison of Text Indexing Methods

Bitmaps: pros: intuitive, easy to implement and to processcons: space overhead

Signature files:pros: easy to implement, compact index, no lexiconcons: many unnecessary accesses to documents

Inverted files:pros: used by most search enginescons: large space overhead, index can be

compressed

Page 50: Information Retrieval ( IR )

E.G.M Petrakis Multimedia Information Retrieval

50

References “Searching Multimedia Databases by Content”, C.

Faloutsos, Kluwer Academic Publishers, 1996 “Modern Information Retrieval”, R. Baeza-Yates,

B. Ribeiro-Neto, Addison Wesley, 1999 “Automatic Text Processing”, Gerard Salton,

Addison Wesley, 1989 Information Retrieval Links:

http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html