comp3410 db32: technologies for knowledge management lecture 7: query broadening to improve ir by...
TRANSCRIPT
![Page 1: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/1.jpg)
COMP3410 DB32:Technologies for Knowledge Management
Lecture 7:
Query Broadening to improve IR
By Eric Atwell, School of Computing, University of Leeds
(including re-use of teaching resources from other sources, esp. Stuart Roberts, School of Computing, Univ of Leeds)
![Page 2: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/2.jpg)
Module Objectives
“On completion of this module, students should be able to:
… describe classical and emerging information retrieval techniques, and their relevance to knowledge management; …”
![Page 3: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/3.jpg)
Today’s objectives• first we look at a method for query broadening that
required input from the user
• then we look at an automatic method for query broadening using a thesaurus
• by the end of the lecture you should understand what a thesaurus, terminology-bank, ontology are, and how they are used to broaden queries
![Page 4: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/4.jpg)
Some issues to be resolved• Synonyms
– football / soccer, tap / faucet: search for one, find both?
• homonyms– lead (metal or leash?), tap: find both, only want one?
• local/global contexts determine “good” terms– football articles: won’t mention word ‘football’;
will have particular meaning for the word ‘goal’
• Precoordination (proximity query): multi-word terms– “Venetian blind” vs “blind Venetian”
![Page 5: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/5.jpg)
Evaluation/Effectiveness measures• effort - required by the users in formulation of queries
• time - between receipt of user query and production of list of ‘hits’
• presentation - of the output
• coverage - of the collection
• recall - the fraction of relevant items retrieved
• precision - the fraction of retrieved items that are relevant
• user satisfaction – with the retrieved items
![Page 6: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/6.jpg)
Better hits: Query Broadening• User unaware of collection characteristics is likely to
formulate a ‘naïve’ query
• query broadening aims to replace the initial query with a new one featuring one or other of:– new index terms– adjusted term weights
• One method uses feedback information from the user
• Another method uses a thesaurus / term-bank / ontology
![Page 7: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/7.jpg)
From response to initial query, gather relevance informationHR = RH = set of retrieved, relevant hitsHNR = H-R = set of retrieved, non-relevant hits
replace query q with replacement query q' :q' = q
di / |HR|
di / |HNR|
note: this moves the query vector closer to the centroid of the “relevant retrieved” document vectors and further from the centroid of the “non-relevant retrieved” documents.
di HNR
di HR
Relevance Feedback
![Page 8: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/8.jpg)
Using terms from relevant documents• We expect documents that are similar to one another in
meaning (or usefulness) to have similar index terms.
• The system creates a replacement query (q’) based on q, but adds index terms that have been used to index known relevant documents, increases the relative weight of index terms in q that are also found in relevant documents, and reduces the weight of terms found in non-relevant documents.
![Page 9: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/9.jpg)
How does this help?• It could help if documents were being missed because of the
synonym problem. The user uses the word ‘jam’, but some recipes use ‘jelly’ instead. Once a hit that uses ‘jelly’ has been recognized as relevant, then ‘jelly’ will appear n the next version of the query. Now hits may use ‘jelly’ but not ‘jam’.
• Conversely, it can help with the homonym problem. If the user wants references to ‘lead’ (the metal), and gets documents relating to dog-walking, then by marking the dog-walking references as not relevant, key words associated with dog-walking will be reduced in weight
![Page 10: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/10.jpg)
pros and cons of feedback• If is set = 0, ignore non-relevant hits, a positive
feedback system; often preferred
• the feedback formula can be applied repeatedly, asking user for relevance information at each iteration
• relevance feedback is generally considered to be very effective for “high-use” systems
• one drawback is that it is not fully automatic.
![Page 11: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/11.jpg)
Simple feedback example:
T = {pudding, jam, traffic, lane, treacle}
d1 = (0.8, 0.8, 0.0, 0.0, 0.4),
d2 = (0.0, 0.0, 0.9, 0.8, 0.0),
d3 = (0.8, 0.0, 0.0, 0.0, 0.8)
d4 = (0.6, 0.9, 0.5, 0.6, 0.0)
Recipe for jam pudding
DoT report on traffic lanes
Radio item on traffic jam in Pudding Lane
Recipe for treacle pudding
Display first 2 documents that match the following query:q = (1.0, 0.6, 0.0, 0.0, 0.0)
r = (0.91, 0.0, 0.6, 0.73)
Retrieved documents are:
d1 : Recipe for jam pudding
d4 : Radio item on traffic jam
relevant
not relevant
![Page 12: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/12.jpg)
Suppose we set and to 0.5, to 0.2
q' = q di / | HR | di / | HNR|
= 0.5 q + 0.5 d1 0.2 d4
= 0.5 (1.0, 0.6, 0.0, 0.0, 0.0)+ 0.5 (0.8, 0.8, 0.0, 0.0, 0.4) 0.2 (0.6, 0.9, 0.5, 0.6, 0.0)
= (0.78, 0.52, 0.1, 0.12, 0.2)
(Note |Hn| = 1 and |Hnr| = 1)
di HR di HNR
Positive and Negative Feedback
![Page 13: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/13.jpg)
Simple feedback example:
T = {pudding, jam, traffic, lane, treacle}
d1 = (0.8, 0.8, 0.0, 0.0, 0.4),
d2 = (0.0, 0.0, 0.9, 0.8, 0.0),
d3 = (0.8, 0.0, 0.0, 0.0, 0.8)
d4 = (0.6, 0.9, 0.5, 0.6, 0.0)
Display first 2 documents that match the following query:q’ = (0.78, 0.52, 0.1, 0.12, 0.2)
r’ = (0.96, 0.0, 0.86, 0.63) Retrieved documents are:
d1 : Recipe for jam pudding
d3 : Recipe for treacle pud
relevant
relevant
![Page 14: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/14.jpg)
Thesaurus• a thesaurus or ontology may contain
– controlled vocabulary of terms or phrases describing a specific restricted topic,
– synonym classes, – hierarchy defining broader terms (hypernyms) and narrower
terms (hyponyms)– classes of ‘related’ terms.
• a thesaurus or ontology may be:– generic (as Roget’s thesaurus, or WordNet)– specific to a certain domain of knowledge, eg medical
![Page 15: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/15.jpg)
Language normalisation
Content analysis
Uncontrolled keywords
Thesaurus
Index terms
User query
Normalised query
match
by replacing words from documents and query words with synonyms from a controlled language, we can improve precision and recall:
![Page 16: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/16.jpg)
Thesaurus / Ontology construction
• Include terms likely to be of value in content analysis
• for each term, form classes of related words (separate classes for synonyms, hypernyms, hyponyms)
• form separate classes for each relevant meaning of the word
• terms in a class should occur with roughly equal frequency (not easy – NL has Zipf’s law word-freq )
• avoid high-frequency terms• it involves some expert judgment that will not be
easy to automate.
![Page 17: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/17.jpg)
Example thesaurusA public-domain thesaurus (WORDNET) is available from:
http://www.cogsci.princeton.edu/~wn/
/home/cserv1_a/staff/nlplib/WordNet/2.0
/home/cserv1_a/staff/extras/nltk/1.4.2/corpora/wordnet
computer
data processor electronic computer
information processing system
synonyms (sense 1):
![Page 18: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/18.jpg)
Example thesaurusA public-domain thesaurus (WORDNET) is available from:
http://www.cogsci.princeton.edu/~wn/
computercalculator
reckonerfigurer
estimator
synonyms (sense 2):
![Page 19: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/19.jpg)
Hypernym is the generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a (kind of) Y.
Hyponym is the generic term used to designate a member of a class. X is a hyponym of Y if X is a (kind of) Y.
Coordinate words are words that have the same hypernym.
Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".
Terminology (from WordNet Help)
![Page 20: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/20.jpg)
HypernymsSense 1computer, data processor, electronic computer, information processing system-> machine -> device -> instrumentality, instrumentation -> artifact, artefact -> object, physical object -> entity, something
Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".
![Page 21: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/21.jpg)
HyponymsSense 1
computer, data processor, electronic computer, information processing system=> analog computer, analogue computer=> digital computer=> node, client, guest=> number cruncher=> pari-mutuel machine, totalizer, totaliser, totalizator, totalisator=> server, host
Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".
![Page 22: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/22.jpg)
Sense 1computer, data processor, electronic computer, information processing system-> machine=> assembly=> calculator, calculating machine=> calendar=> cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM=> computer, data processor, electronic computer, information processing system=> concrete mixer, cement mixer=> corker=> cotton gin, gin=> decoder
Coordinate terms
![Page 23: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/23.jpg)
Thesaurus use • replace term in document and/or query with term in
controlled language• replace term in query with related or broader term to
increase recall• suggest to user narrower terms to increase precision
Doc: <data processor>
Query: < electronic computer>
Thesaurus
computer (sense 1)
computer (sense 1)
match
S
![Page 24: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/24.jpg)
Thesaurus use• replace term in document and/or query with term in
controlled language• replace term in query with related or broader term to
increase recall• suggest to user narrower terms to increase precision
Thesaurus
Query: <computer (sense 1)>
match
All collection
Query: <node(sense 6)>
match
All collectionB
![Page 25: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/25.jpg)
Thesaurus use• replace term in document and/or query with term in
controlled language• replace term in query with related or broader term to
increase recall• suggest to user narrower terms to increase precision
Thesaurus
Query: client
match
All collection
match
All collectionN
Query: <computer (sense 1)>
User
![Page 26: COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515fb7155034694308b495f/html5/thumbnails/26.jpg)
Key points• a thesaurus or ontology can be used to normalise a
vocabulary and queries (?or documents?)
• it can be used (with some human intervention) to increase recall and precision
• generic thesaurus/ontology may not be effective in specialized collections and/or queries
• Semi-automatic construction of thesaurus/ontology based on the retrieved set of documents has produced some promising results.