Download - Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 7: Text Operations Chapter 7: Text Operations

Alexander Gelbukh

www.Gelbukh.com

2

Previous chapter: ConclusionsPrevious chapter: Conclusions

Modeling of text helps predict behavior of systemso Zipf law, Heaps’ law

Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search

Languages to describe document syntaxo SGML, too expensive

o HTML, too simple

o XML, good combination

3

Text operationsText operations

Linguistic operations Document clustering Compression Encription (not discussed here)

4

Linguistic operationsLinguistic operations

Purpose: Convert words to “meanings” Synonyms or related words

o Different words, same meaning. Morphology

o Foot / feet, woman / female

Homonymso Same words, different meanings. Word senses

o River bank / financial bank

Stopwordso Word, no meaning. Functional words

o The

5

For good or for bad?For good or for bad?

More exact matchingo Less noise, better recall

Unexpected behavioro Difficult for users to graspo Harms if introduces errors

More expensiveo Adds a whole new technologyo Maintenance; language dependentso Slows down

Good if done well, harmful if done badly

6

Document preprocessingDocument preprocessing

Lexical analysis (punctuation, case)o Simple but must be careful

Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ...

o Multiword expressions: hot dog, B-52

o Here, all the power of linguistic analysis can be used

Selection of index termso Often nouns; noun groups: computer science

Construction of thesauruso synonymy: network of related concepts (words or phrases)

7

StemmingStemming

Methodso Linguistic analysis: complex, expensive maintenance

o Table lookup: simple, but needs data

o Statistical (Avetisyan): no data, but imprecise

o Suffix removal

Suffix removalo Porter algorithm. Martin Porter. Ready code on his website

o Substitution rules: sses s, s o stresses stress.

8

Better stemmingBetter stemming

The whole problematics of computational linguistics POS disambiguation

o well adverb or noun? Oil well.

o Statistical methods. Brill tagger

o Syntactic analysis. Syntactic disambiguation

Word sense disambiguatiuono bank1 and bank2 should be different stems

o Statistical methods

o Dictionary-based methods. Lesk algorithm

o Semantic analysis

9

ThesaurusThesaurus

Terms (controlled vocabulary) and relationships Terms

o used for indexingo represent a concept. One word or a phrase. Usually nounso sense. Definition or notes to distinguish senses: key (door).

Relationshipso Paradigmatic:

Synonymy, hierarchical (is-a, part), non-hierarchical

o Syntagmatic: collocations, co-occurrences WordNet. EuroWordNet

o synsets

10

Use of thesurusUse of thesurus

To help the user to formulate the queryo Navigation in the hierarchy of words

o Yahoo!

For the program, to collate related termso woman female

o fuzzy comparison: woman 0.8 * female. Path length

11

Yahoo! vs. thesaurusYahoo! vs. thesaurus

The book says Yahoo! is based on a thesaurus.

I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation

12



13

Document clusteringDocument clustering

Operation on the whole collection Global vs. local Global: whole collection

o At compile time, one-time operation

Localo Cluster the results of a specific query

o At runtime, with each query

Is more a query transformation operationo Already discussed in Chapter 5

14



15

CompressionCompression

Gain: storage, transmission, search Lost: time on compressing/decompressing

In IR: need for random access. o Blocks do not work

Also: pattern matching on compressed text

16

Compression methodsCompression methods

Statistical Huffman: fixed size per symbol.

o More frequent symbols shorter

o Allows starting decompression from any symbol

Arithmetic: dynamic codingo Need to decompress from the beginning

o Not for IR

Dictionary Pointers to previous occurrences. Lampel-Ziv

o Again not for IR

17

Compression ratioCompression ratio

Size compressed / size decompressed

Huffman, units = words: up to 2 bits per charo Close to the limit = entropy. Only for large texts!

o Other methods: similar ratio, but no random access

Shannon: optimal length for symbol with probability p is - log2 p

Entropy: Limit of compressiono Average length with optimal coding

o Property of model

18

ModelingModeling

Find probability for the next symbol Adaptive, static, semi-static

o Adaptive: good compression, but need to start frombeginning

o Static (for language): poor compression, random access

o Semi-static (for specific text; two-pass): both OK

Word-based vs. character-basedo Word-based: better compression and search

19

Huffman codingHuffman coding

Each symbol is encoded, sequentially More frequent symbols have shorter codes No code is a prefix of another one

How to buildthe tree: book

Byte codesare better

Allow forsequentialsearch

20

Dictionary-based methodsDictionary-based methods

Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence

o Adaptive

Disadvantages for IRo Need to decode from the very beginning

o New statistical methods perform better

21

Comparison of methodsComparison of methods

22

Compression of inverted filesCompression of inverted files

Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps.

o Short gaps occur more frequently

o Statistical compression

Our work: order the docs for better compressiono We code runs of docs

o Minimize the number of runs

o Distance: # of different words

o TSP.

23

Research topicsResearch topics

All computational linguisticso Improved POS tagging

o Improved WSD

Uses of thesauruso for user navigation

o for collating similar terms

Better compression methodso Searchable compression

o Random access

24

ConclusionsConclusions

Text transformation: meaning instead of stringso Lexical analysis

o Stopwords

o Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems

Text compressiono Searchable

o Random access

o Word-based statistical methods (Huffman)

Index compression

25

Thank you!Till compensation

lecture

Download - Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh

Top Related