ir traditional model
DESCRIPTION
IR Traditional ModelTRANSCRIPT
© Tefko Saracevic 1
Information retrieval (IR):
traditional model
1. Why? Rationale for the module. Definition of IR
2. System & user components
3. Exact match & best match searches
4. Strengths & weaknesses
© Tefko Saracevic 2
1. Why? Rationale for the module.
Definition of IR
includes problems addressed in IR
© Tefko Saracevic 3
Why? • Every online database, every
search engine, everything that is searched online is based in some way or another on principles developed in IR– IR is at the heart of searching used in
systems such as DIALOG, LexisNexis & others
• Understanding the basics of IR is a prerequisite for understanding how searching of online systems works.
© Tefko Saracevic 4
You are asking:
• What basic elements and processes are involved in IR?
• What are the conceptual bases for searching?
• How are these applied in practice?
© Tefko Saracevic 5
IR: - original definition
“Information retrieval embraces the intellectual aspects of the description of information and its specification for search, and also whatever systems, techniques, or machines are employed to carry out the operation.”
Calvin Mooers, 1951
© Tefko Saracevic 6
IR:Objective & problems
Provide the users with effective access to & interaction with information resources.
Problems addressed:
1. How to organize information intellectually?
2. How to specify search & interaction intellectually?
3. What systems & techniques to use for those processes?
Where do you fit? With what problems do you deal?
© Tefko Saracevic 7
2. System & user components
Traditional IR model presented
© Tefko Saracevic 8
IR models
• Model depicts, represents what is involved– a choice of features, processes, things
for consideration
• Several IR models used over time– traditional: oldest, most used, shows
basic elements involved treated in this module
– interactive: more realistic, favored now, shows also interactions involved treated in next module (module 5)
– Each has strengths, weaknesses
© Tefko Saracevic 9
Description of traditional IR model
• It has two streams of activities – one is the systems side with processes
performed by the system– other is the user side with processes
performed by users & intermediaries (you)– these two sides led to “system orientation” &
“user orientation”– in system side automatic processing is done;
in user side human processing is done
• They meet at the matching process– where the query is fed into the system and
system looks for documents that match the query
• Also feedback is involved so that things change based on results – e.g. query is modified & new matching done
© Tefko Saracevic 10
Traditional IR model
File organizationindexed documents
Acquisitiondocuments, objects
Representationindexing, ...
Probleminformation need
Representationquestion
Querysearch formulation
Matchingsearching
Retrieved objects
feedba
ck
System User
© Tefko Saracevic 11
• Content: What is in files, resources– in DIALOG first part of blue sheets: File
Description, Subject Coverage
• Selection of documents & other objects from various sources– in blue sheets: Sources
• Mostly text based documents– full texts, titles, abstracts ...– but also other objects:
data, statistics, images, maps, trade marks, sounds ...
Acquisition(system)
Importance:Determines contents – what
is in it Key to file, resource
selection !!!
© Tefko Saracevic 12
• Indexing – many ways :– free text terms (even in full texts)
– controlled vocabulary - thesaurus
– manual & automatic techniques
• Abstracting; summarizing• Bibliographic description:
– author, title, sources, date…
– metadata
• Classifying, clustering • Organizing in fields & limits
– in DIALOG: Basic Index, Additional Index. Limits
Representationof documents, objects
(system)
Basic to what is available for searching & displaying
© Tefko Saracevic 13
• Sequential – record (document) by record
• Inverted – term by term; list of records under
each term
• Combination: indexes inverted, documents sequential
• When citation retrieved only, need for document files
• Large file approaches– for efficient retrieval by computers
File organization(system)
Enables searching & interplay between types of files
© Tefko Saracevic 14
• Related to user’s task, situation
– vary in specificity, clarity
• Produces information need– ultimate criterion for effectiveness of
retrievalhow well was the need met?
• Inf. need for the same problem may change, evolve, shift during the IR process - adjustment in searching– often more than one search for same
problem over timeyou will experience this in your term project
Problem(user)
Critical for examination in interview
© Tefko Saracevic 15
• Non-mediated: end user alone• Mediated: intermediary + user
– interviews; human-human interaction
• Question analysis– selection, elaboration of terms– various tools may be used
thesaurus, classification schemes, dictionaries, textbooks, catalogs …
• Focus toward– deriving search terms & logic– selection of files, resources
• Subject to feedback changes • Critical roles of intermediary - you
Representation - question( user & possibly system)
Determines search specification - a dynamic process
© Tefko Saracevic 16
• Translation into systems requirements & limits – start of human-computer interaction
query is the thing that goes into the computer
• Selection of files, resources• Search strategy - selection of:
– search terms & logic– possible fields, delimiters – controlled & uncontrolled vocabulary– variations in effectiveness tactics
• Reiterations from feedback – several feedback types: relevance feedback,
magnitude feedback *...– query expansion & modification
Query - search statement(user & system)
What & how of actual searching
© Tefko Saracevic 17
Clarifying difference
• Question is what user asks and what you may then have elaborated
• Query is what is asked of computer to match – what is put in
• Question is transformed into query• Question:
– I am interested in major historical developments in the area of information retrieval?
• Query– history information retrieval (in Google)– history AND information(w)retrieval (in
DIALOG) (plus you have to select which file(s) to search)
© Tefko Saracevic 18
• Process of matching, comparing– search: what documents in the file
match the query as stated?
• Various search algorithms:– exact match - Boolean
still available in most, if not all systems
– best match - ranking by relevance increasingly used e.g. on the web
– hybrids incorporating bothe.g. Target, Rank in DIALOG
• Each has strengths, weaknesses– no ‘perfect’ method exists
and probably never will
Matching - searching(user & system)
Involves many types of search interactions & formulations
© Tefko Saracevic 19
• Various order of output:– Last In First Out (LIFO); sorted– ranked by relevance– ranked by other characteristics
• Various forms of output– In DIALOG: Output options
• When citations only: possible links to document delivery
• Base for relevance, utility evaluation by users
• Relevance feedback
Retrieved documents(from system to user)
What a user (or you) sees, gets, judges – can be specified
© Tefko Saracevic 20
3. Exact match & best match searches
Getting to that Boolean and similar stuff – the nitty-gritty
of matching
which actually affects how you formulate the query
© Tefko Saracevic 21
Exact match - Boolean search
• You retrieve exactly what you ask for in the query:– all documents that have the term(s)
with logical connection(s), and possible other restrictions (e.g. to be in titles) as stated in the query
– exactly: nothing less, nothing more
• Based on matching following rules of Boolean algebra, or algebra of sets– ‘new algebra’– presented by circles in Venn
diagrams
© Tefko Saracevic 22
Boolean algebra• Operates on sets
– e.g. set of documents
• Has four operations (like in algebra):1. A: retrieve set A
I want documents that have the term library
2. A AND B: retrieve set that has A and B often called intersection & labeled A B I want documents that have both terms library
and digital someplace within
3. A OR B: retrieve set that has either A or B often called union and labeled A B I want documents that have either term library
or term digital someplace within
4. A NOT B: retrieve set A but not B often called negation and labeled A – B I want documents that have term library but if
they also have term digital I do not want those
© Tefko Saracevic 23
Potential problems• But beware:
– digital AND library will retrieve documents that have digital library (together as a phrase) but also documents that have digital in the first paragraph and library in the third section, 5 pages later, and it does not deal with digital libraries at all
– thus in Google you will ask for “digital library” and in DIALOG for digital(w)library to retrieve the exact phrase digital library
– digital NOT library will retrieve documents that have digital and suppress those that along with digital also have library, but sometimes those suppressed may very well be relevant. Thus, NOT is also known as the “dangerous operator “
© Tefko Saracevic 24
Boolean algebra depicted in Venn diagrams
Four basic operations:e.g. A = digital B= libraries
1 2 3
A BA alone. All documents that have A. Shade 1 & 2. digital
1 2 3
A B
A AND B. Shade 2
digital AND libraies
1 2 3
A B
A OR B. Shade 1, 2, 3
digital OR libraries
1 2 3
A B
A NOT B. Shade 1
digital NOT libraries
© Tefko Saracevic 25
Venn diagrams … cont.
Complex statements allowed e.g
4
12
3
5 6
7
A B
C
(A OR B) AND C
Shade 4,5,6
(digital OR libraries) AND Rutgers
(A OR B) NOT C
Shade what?
(digital OR libraries) NOT Rutgers
© Tefko Saracevic 26
Venn diagrams cont.
• Complex statements can be made– as in ordinary algebra e.g. (2+3)x4
• As in ordinary algebra: watch for parenthesis:– 2+(3 x 4)
is not the same as (2+3)x4
– (A AND B) OR C is not the same as A AND (B OR C)
© Tefko Saracevic 27
Best match searching
• Output is ranked– it is NOT presented as a Boolean set but in
some rank order
• You retrieve documents ranked by how similar (close) they are to a query (as calculated by the system)– similarity assumed as relevance– ranked from highest to lowest relevance to the
query mind you, as considered by the systemyou change the query, system changes rank
– thus, documents as answers are presented from those that are most likely relevant downwards to less & less likely relevant
– can be cut at any desired number - e.g. first 10
© Tefko Saracevic 28
Best match ... cont.• Best match process deals with
PROBABILITY:– compares the set of query terms with the
sets of terms in documents– calculates a similarity between query &
each document based on common terms &/or other aspects
– sorts the documents in order of similarity– assumes that the higher ranked documents
have a higher probability of being relevant– allows for cut-off at a chosen number
• BIG issue: What representation & similarity measures are better?– “better” determined by a number of criteria,
e.g. relevance, speed …
© Tefko Saracevic 29
Best match (cont.)
• Variety of algorithms (formulas) used to determine similarity– using statistic &/or linguistic properties
e.g. if digital appears a lot in a given document relative to its size, that document will be ranked higher when the query is digital
– many proposed & tested in IR research– many developed by commercial
organizationsGoogle also uses calculations as to number
of links to/from a document many algorithms are now proprietary
– system ranking and your ranking may not necessarily be in agreement
• Web outputs are mostly ranked• But DIALOG allows ranking as well,
with special commands
© Tefko Saracevic 30
4. Strengths & weaknesses
© Tefko Saracevic 31
Boolean vs. best match
• Boolean– allows for logic– provides all that
has been matched
BUT– has no particular
order of output– treats all
retrievals equally - from the most to least relevant ones
– often requires examination of large outputs
• Best match– allows for free
terminology– provides for a
ranked output– provides for cut-
off - any size output
BUT– does not include
logic– ranking method
(algorithm) not transparent
whose relevance?
– where to cut off?
© Tefko Saracevic 32
Strengths of traditional IR model
• Lists major components in both system & user branches
• Suggests:– What to explain to users about
system, if needed– What to ask of users for more
effective searching (problem ...)
• Selection of component(s) for concentration– mostly ever better representation
• Provides a framework for evaluation of (static) aspects
© Tefko Saracevic 33
Weaknesses
• Does not address nor account for interaction & judgment of results by users– identifies interaction with search only– interaction is a much richer process
• Many types of & variables in interaction not reflected
• Feedback has many types & functions - also not shown
• Evaluation thus one-sided
IR is a highly interactive process- thus additional model(s) needed
© Tefko Saracevic 34
Interactive models
• Explored in next module
Module 5