special topics in computer science the art of information retrieval chapter 4: query languages...
TRANSCRIPT
![Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/1.jpg)
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 4: Query LanguagesChapter 4: Query Languages
Alexander Gelbukh
www.Gelbukh.com
![Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/2.jpg)
2
Previous ChapterPrevious Chapter
Main measures: Precision & Recall.o For sets
o Rankings are evaluated through initial subsets
There are measures that combine them into oneo Involve user-defined preferences. In F-measure set to 50-50
Many (other) characteristicso An algorithm can be good at some and bad at others
o Averages are used, but not always are meaningful
Reference collection exists with known answers to evaluate new algorithms
![Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/3.jpg)
3
Previous chapter: research issuesPrevious chapter: research issues
Different types of interfaces; interactive systems:o What measures to use?
o How people judge relevance?
o How the “user satisfaction” can be measured? Modeled?
![Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/4.jpg)
4
Query languagesQuery languages
Query language = type of possible queries Type of queries depend on the IR model Types:
o IR (= ranked output)o Data retrieval
o User-orientedo Low-level (= protocols)
Assume all pre-processing has been doneo Thesaurus, stop-words, ...
o (I think this must be a part of the language!)
Returns “documents” (chapter, paragraph, ...)
![Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/5.jpg)
5
In this chapterIn this chapter
Keyword-based languages Pattern matching Structure taken into account Protocols
![Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/6.jpg)
6
Keyword-based languages: Single wordKeyword-based languages: Single word
Intuitive, easy to express, fast ranking.o Words can be highlighted in the output.
What a word is? o Letters, separators
o Non-splitting characters: on-line.
o Database decides.
TF-IDF are designed for words Used for the main models (Boolean, Vector,
Probabilistic)
![Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/7.jpg)
7
Keyword-based languages:Keyword-based languages:Context QueriesContext Queries
Ensure that the words are related Phrase
o “enhance retrieval”
o Allows separators and stopwords: “enhance the retrieval”
Proximityo “enhance the quality of information retrieval”
o Distance: words, letters. Order: same or not
Not clear how to ranko Research issue
![Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/8.jpg)
8
Keyword-based languages:Keyword-based languages:Boolean QueriesBoolean Queries
Boolean expressions (can combine basic queries)
Query syntax tree
o translation AND (syntax OR syntactic)
operations on the setso Result: set
OR, AND, e1 BUT e2
o NOT not used, could give (almost) all docs (= unsafe)
Good: Can highlight occurrences, sort Bad: Difficult for the users Remedy (?): fuzzy Boolean (see below).
Basic = keyword, pattern
![Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/9.jpg)
9
Keyword-based languages:Keyword-based languages:Fuzzy Boolean, Fuzzy Boolean, Natural LanguageNatural Language
Fuzzy Boolean: OR AND = some.o AND punishes for absence, OR encourages multiple.
o Natural ranking: how many times?
Natural Language: OR = ANDo BUT can be expressed (= penalty)
o How to rank? Different ways
Vector space modelo Query is a vector
o A doc can be taken as a vector. Relevance feedback!
Proximity is ignoredo (Why? Research issue.)
![Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/10.jpg)
10
Pattern matching...Pattern matching...
Pattern = sequence of featureso Text segment matches the pattern
Types: Words Prefixes, suffixes, substrings:
o comput-, -ters, -any flow- (many flowers). Ranges
o implies some order, e.g., lexicographical = alphabetic Allowing errors
o Levenshtein (= edit) distance: historical / hystericalo # insertions, deletions, replacements. Threshold.
![Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/11.jpg)
11
...Pattern matching...Pattern matching
...Types Regular expressions
o union = or: if e1, e2 are expressions, (e1 | e2) too
o concatenation: e1 e2
o repetition: e* (0 or more occurrences)
Extended patternso user-friendly; can be internally converted into simple
o case-insensitive, “anything” (wildcard), digit, vowel, ...
o conditionals, optional
o some parts match exactly and other with errors,
o etc.
![Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/12.jpg)
12
Structural queriesStructural queries
Old days: fields. No nesting, no overlap, fixed order.o Email: subject, body, sender, ...
o = Relational database with text type, treated as text should be
o Versions of SQL with text operators
Hypertexto Not well developed. Too free
o WebGlimpse: search the neighborhood
Hierarchicalo Intermediate level of freedom
o Volumes, chapters, sections, paragraphs, sentences, ...
![Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/13.jpg)
Too fixed Too free Intermediate
![Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/14.jpg)
14
Hierarchical Models ...Hierarchical Models ...
PAT expressionso Hierarchy is defined at query time.
o Regions are included in the index, e.g., sections, italics, ...
o Different types of regions can overlap, same type can’t
o Can query for words in a region, regions in a region, etc.
o Complex computation, unclear semantics
Overlapped listso Evolution of PAT: areas of same type can overlap (not nest)
o Uses same inverted file
o Can combine regions, specify order, ...
o n-words: all (overlapping) areas of n words.
![Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/15.jpg)
15
Overlapping listsOverlapping lists
![Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/16.jpg)
16
... Hierarchical Models ...... Hierarchical Models ...
List of referenceso Answers are references (pointers) to regions
o Only one type of regions (e.g., only sections). No nesting.
o Known at index time
o Ancestry of nodes. Can query paths
Proximal nodeso Compromise between expressiveness and efficiency
o Many (overlapping) fixed hierarchies
o Interesting queries: “3rd paragraph of each chapter”, ...
![Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/17.jpg)
17
Proximal nodesProximal nodes
![Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/18.jpg)
18
... Hierarchical Models ... Hierarchical Models
Tree matchingo Query is a tree. Match the text tree.
o Ordered or unordered trees (are siblings ordered?)
o Prolog-like constraints on different parts of the tree Variables
o Answer: root of a match
o Very inefficient (usually NP-hard) Due to variables and unordered matching
![Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/19.jpg)
19
Research issuesResearch issuesin hierarchical modelsin hierarchical models
Static or dynamic?o Define the hierarchy at index time or at query time?
o Static: text markup. Dynamic: tags, indexed.
Restrictions on the structureo Restrict structure of restrict the query language
o For efficiency
Integration with texto of secondary importance: structure (in IR) or text (in DB)?
o combine
Query languageo Standardization, expressiveness taxonomy, categorization
![Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/20.jpg)
20
Query protocolsQuery protocols
Used internally Standard: one client can query different libraries
o In CD-ROMS, disk interchangeability
Z39.50: bibliographic (used for other types, too) WAIS (Wide Area Information Service)
o Includes Z39.50
For CD-ROMs:o CCL, Common Command Language
o CD-RDx (Compact Disk Read only Data Exchange)
o SFQL (Structured Full-text Query Language). Like DB.
![Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/21.jpg)
Types of querieswe have discussed
![Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/22.jpg)
22
Trends and research topicsTrends and research topics
Models: to better understand the user needs Query languages: flexibility, power, expressiveness,
functionality Visual languages
o Example: library shown on the screen. Act: take books, open catalogs, etc.
o Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!
![Page 23: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/23.jpg)
23
ConclusionsConclusions
Width-wide:o words, phrases, proximity, fuzzy Boolean, natural
language
Depth-wide:o Pattern matching
If return sets, can be combined using Boolean model Combining with structure
o Hierarchical structure
Standardized low level languages: protocolso Reusable
![Page 24: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022070305/5514777e550346b2598b45f8/html5/thumbnails/24.jpg)
24
Thank you!
Till October 16October 23: midterm exam