special topics in computer science the art of information retrieval chapter 7: text operations...
TRANSCRIPT
![Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/1.jpg)
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 7: Text Operations Chapter 7: Text Operations
Alexander Gelbukh
www.Gelbukh.com
![Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/2.jpg)
2
Previous chapter: ConclusionsPrevious chapter: Conclusions
Modeling of text helps predict behavior of systemso Zipf law, Heaps’ law
Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search
Languages to describe document syntaxo SGML, too expensive
o HTML, too simple
o XML, good combination
![Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/3.jpg)
3
Text operationsText operations
Linguistic operations Document clustering Compression Encription (not discussed here)
![Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/4.jpg)
4
Linguistic operationsLinguistic operations
Purpose: Convert words to “meanings” Synonyms or related words
o Different words, same meaning. Morphology
o Foot / feet, woman / female
Homonymso Same words, different meanings. Word senses
o River bank / financial bank
Stopwordso Word, no meaning. Functional words
o The
![Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/5.jpg)
5
For good or for bad?For good or for bad?
More exact matchingo Less noise, better recall
Unexpected behavioro Difficult for users to graspo Harms if introduces errors
More expensiveo Adds a whole new technologyo Maintenance; language dependentso Slows down
Good if done well, harmful if done badly
![Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/6.jpg)
6
Document preprocessingDocument preprocessing
Lexical analysis (punctuation, case)o Simple but must be careful
Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ...
o Multiword expressions: hot dog, B-52
o Here, all the power of linguistic analysis can be used
Selection of index termso Often nouns; noun groups: computer science
Construction of thesauruso synonymy: network of related concepts (words or phrases)
![Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/7.jpg)
7
StemmingStemming
Methodso Linguistic analysis: complex, expensive maintenance
o Table lookup: simple, but needs data
o Statistical (Avetisyan): no data, but imprecise
o Suffix removal
Suffix removalo Porter algorithm. Martin Porter. Ready code on his website
o Substitution rules: sses s, s o stresses stress.
![Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/8.jpg)
8
Better stemmingBetter stemming
The whole problematics of computational linguistics POS disambiguation
o well adverb or noun? Oil well.
o Statistical methods. Brill tagger
o Syntactic analysis. Syntactic disambiguation
Word sense disambiguatiuono bank1 and bank2 should be different stems
o Statistical methods
o Dictionary-based methods. Lesk algorithm
o Semantic analysis
![Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/9.jpg)
9
ThesaurusThesaurus
Terms (controlled vocabulary) and relationships Terms
o used for indexingo represent a concept. One word or a phrase. Usually nounso sense. Definition or notes to distinguish senses: key (door).
Relationshipso Paradigmatic:
Synonymy, hierarchical (is-a, part), non-hierarchical
o Syntagmatic: collocations, co-occurrences WordNet. EuroWordNet
o synsets
![Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/10.jpg)
10
Use of thesurusUse of thesurus
To help the user to formulate the queryo Navigation in the hierarchy of words
o Yahoo!
For the program, to collate related termso woman female
o fuzzy comparison: woman 0.8 * female. Path length
![Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/11.jpg)
11
Yahoo! vs. thesaurusYahoo! vs. thesaurus
The book says Yahoo! is based on a thesaurus.
I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation
![Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/12.jpg)
12
Text operationsText operations
Linguistic operations Document clustering Compression Encription (not discussed here)
![Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/13.jpg)
13
Document clusteringDocument clustering
Operation on the whole collection Global vs. local Global: whole collection
o At compile time, one-time operation
Localo Cluster the results of a specific query
o At runtime, with each query
Is more a query transformation operationo Already discussed in Chapter 5
![Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/14.jpg)
14
Text operationsText operations
Linguistic operations Document clustering Compression Encription (not discussed here)
![Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/15.jpg)
15
CompressionCompression
Gain: storage, transmission, search Lost: time on compressing/decompressing
In IR: need for random access. o Blocks do not work
Also: pattern matching on compressed text
![Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/16.jpg)
16
Compression methodsCompression methods
Statistical Huffman: fixed size per symbol.
o More frequent symbols shorter
o Allows starting decompression from any symbol
Arithmetic: dynamic codingo Need to decompress from the beginning
o Not for IR
Dictionary Pointers to previous occurrences. Lampel-Ziv
o Again not for IR
![Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/17.jpg)
17
Compression ratioCompression ratio
Size compressed / size decompressed
Huffman, units = words: up to 2 bits per charo Close to the limit = entropy. Only for large texts!
o Other methods: similar ratio, but no random access
Shannon: optimal length for symbol with probability p is - log2 p
Entropy: Limit of compressiono Average length with optimal coding
o Property of model
![Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/18.jpg)
18
ModelingModeling
Find probability for the next symbol Adaptive, static, semi-static
o Adaptive: good compression, but need to start frombeginning
o Static (for language): poor compression, random access
o Semi-static (for specific text; two-pass): both OK
Word-based vs. character-basedo Word-based: better compression and search
![Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/19.jpg)
19
Huffman codingHuffman coding
Each symbol is encoded, sequentially More frequent symbols have shorter codes No code is a prefix of another one
How to buildthe tree: book
Byte codesare better
Allow forsequentialsearch
![Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/20.jpg)
20
Dictionary-based methodsDictionary-based methods
Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence
o Adaptive
Disadvantages for IRo Need to decode from the very beginning
o New statistical methods perform better
![Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/21.jpg)
21
Comparison of methodsComparison of methods
![Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/22.jpg)
22
Compression of inverted filesCompression of inverted files
Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps.
o Short gaps occur more frequently
o Statistical compression
Our work: order the docs for better compressiono We code runs of docs
o Minimize the number of runs
o Distance: # of different words
o TSP.
![Page 23: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/23.jpg)
23
Research topicsResearch topics
All computational linguisticso Improved POS tagging
o Improved WSD
Uses of thesauruso for user navigation
o for collating similar terms
Better compression methodso Searchable compression
o Random access
![Page 24: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/24.jpg)
24
ConclusionsConclusions
Text transformation: meaning instead of stringso Lexical analysis
o Stopwords
o Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems
Text compressiono Searchable
o Random access
o Word-based statistical methods (Huffman)
Index compression
![Page 25: Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh](https://reader036.vdocuments.net/reader036/viewer/2022062511/5513d6265503466f748b4f35/html5/thumbnails/25.jpg)
25
Thank you!Till compensation
lecture