text operations prepared by : loay alayadhi 425121605 supervised by: dr. mourad ykhlef
Post on 22-Dec-2015
218 views
TRANSCRIPT
![Page 1: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/1.jpg)
Text operations
Prepared By : Loay Alayadhi 425121605
Supervised by: Dr. Mourad Ykhlef
![Page 2: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/2.jpg)
2
Document Preprocessing
• Lexical analysis• Elimination of stopwords• Stemming of the remaining words• Selection of index terms• Construction of term
categorization structures. (thesaurus)
![Page 3: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/3.jpg)
3
Logical View of a Document
document
structurerecognition
text+structure
accents,spacing,etc.
stopwordsnoun
groupsstemming
automaticor manualindexing
structure
text
full textindexterms
![Page 4: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/4.jpg)
4
1)Lexical Analysis of the Text
• Lexical AnalysisConvert an input stream of characters into stream words .– Major objectives is the identification of the
words in the text !! How ?? • Digits. ignoring numbers is a common way
• Hyphens. state-of-the art • punctuation marks. remove them. Exception:
510B.C
• Case
![Page 5: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/5.jpg)
5
2) Elimination of Stopwords
• words appear too often are not useful for IR.
• Stopwords: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words
• Problem– Search for “to be or not to be”?
![Page 6: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/6.jpg)
6
3) Stemming• A Stem: the portion of a word which
is left after the removal of its affixes (i.e., prefixes or suffixes).
• Example– connect, connected, connecting,
connection, connections• Removing strategies
– affix removal: intuitive, simple– table lookup– successor variety– n-gram
![Page 7: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/7.jpg)
7
4) Index Terms Selection• Motivation
– A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives.
– Most of the semantics is carried by the noun words.
• Identification of noun groups– A noun group is a set of nouns whose
syntactic distance in the text does not exceed a predefined threshold
![Page 8: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/8.jpg)
8
5) Thesaurus Construction
• Thesaurus: a precompiled list of important words in a given domain of knowledge and for each word in this list, there is a set of related words.
• A controlled vocabulary for the indexing and searching. Why?– Normalization,– indexing concept ,– reduction of noise,– identification, ..ect
![Page 9: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/9.jpg)
9
The Purpose of a Thesaurus
• To provide a standard vocabulary for indexing and searching
• To assist users with locating terms for proper query formulation
• To provide classified hierarchies that allow the broadening and narrowing of the current query request
![Page 10: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/10.jpg)
10
Thesaurus (cont.)• Not like common dictionary
– Words with their explanations
• May contain words in a language• Or only contains words in a specific
domain.• With a lot of other information especially
the relationship between words– Classification of words in the language– Words relationship like synonyms, antonyms
![Page 11: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/11.jpg)
11
Roget thesaurusExample
cowardly adjective (جبان)Ignobly lacking in courage: cowardly turncoatsSyns: chicken (slang), chicken-hearted,
craven, dastardly, faint-hearted, gutless, lily-livered,pusillanimous, unmanly, yellow (slang), yellow-bellied (slang)
• http://www.thesaurus.com• http://www.dictionary.com/
![Page 12: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/12.jpg)
12
Thesaurus Term Relationships
• BT: broader• NT: narrower• RT: non-hierarchical, but related
![Page 13: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/13.jpg)
13
Use of Thesaurus
• IndexingSelect the most appropriate thesaurus entries for representing the document.
• SearchingDesign the most appropriate search strategy.– If the search does not retrieve enough
documents, the thesaurus can be used to expand the query.
– If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary
![Page 14: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/14.jpg)
14
Document clustering• Document clustering : the
operation of grouping together similar documents in classes
• Global vs. local• Global: whole collection
– At compile time, one-time operation• Local
– Cluster the results of a specific query– At runtime, with each query
![Page 15: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/15.jpg)
15
Text Compression
• Why text compression is important?– Less storage space– Less time for data transmission– Less time to search (if the
compression method allows direct search without decompression)
![Page 16: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/16.jpg)
16
Terminology• Symbol
– The smallest unit for compression (e.g., character, word, or a fixed number of characters)
• Alphabet– A set of all possible symbols
• Compression ratio– The size of the compressed file as a
fraction of the uncompressed file
![Page 17: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/17.jpg)
17
Types of compression models
• Static models– Assume some data properties in
advance (e.g., relative frequencies of symbols) for all input text
– Allow direct (or random) access
– Poor compression ratios when the input text deviates from the assumption
![Page 18: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/18.jpg)
18
Types of compression models
• Semi-static models– Learn data properties in a first pass– Compress the input data in a second pass
– Allow direct (or random) access– Good compression ratio
– Must store the learned data properties for decoding
– Must have whole data at hand
![Page 19: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/19.jpg)
19
Types of compression models
• Adaptive models– Start with no information– Progressively learn the data properties
as the compression process goes on
– Need only one pass for compression
– Do not allow random access• Decompression cannot start in the middle
![Page 20: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/20.jpg)
20
General approaches to text compression
• Dictionary methods– (Basic) dictionary method– Ziv-Lempel’s adaptive method
• Statistical methods– Arithmetic coding– Huffman coding
![Page 21: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/21.jpg)
21
Dictionary methods
• Replace a sequence of symbols with a pointer to a dictionary entry
aaababbbaaabaaaaaaabaabb
Compress
input
babbabaa output
aaa bb dictionary
May be suitable for one text but may be unsuitable for another
![Page 22: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/22.jpg)
22
a|a|b|b|b|a|a|a|b|b
aaababbbaaabaaaaaaabaabb
Compress
• Instead of dictionary entries, pointers point to the previous occurrences of symbols
Adaptive Ziv-Lempel coding
![Page 23: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/23.jpg)
23
Adaptive Ziv-Lempel coding
• Instead of dictionary entries, pointers point to the previous occurrences of symbols
aaababbbaaabaaaaaaabaabb
a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb1 2 3 4 5 6 7 8 9 10
0a|1a|0b|1b|3b|2a|3a|6a|2b|9b 1 2 3 4 5 6 7 8 9 10
![Page 24: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/24.jpg)
24
Adaptive Ziv-Lempel coding
• Good compression ratio (4 bits/character)– Suitable for general data compression
and widely used (e.g., zip, compress)
• Do not allow decoding to start in the middle of a compressed file direct access is impossible without decompression from the beginning
![Page 25: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/25.jpg)
25
Arithmetic coding
• The input text (data) is converted to a real number between 0 and 1, such as 0.328701
Good compression ratio (2 bits/character)
Slow Cannot start decoding in the middle of
a file
![Page 26: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/26.jpg)
26
Symbols and alphabetfor textual data
• Words are more appropriate symbols for natural language text
• Example: “for each rose, a rose is a rose”– Alphabet
• {a, each, for, is, rose, , ‘,’}
– Always assume a single space after a word unless there is another separator• {a, each, for, is, rose, ‘,’}
![Page 27: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/27.jpg)
27
Huffman coding
• Assign shorter codes (bits) to more frequent symbols and longer codes (bits) to less frequent symbols
• Example: “for each rose, a rose is a rose”
![Page 28: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/28.jpg)
28
Example
each , for is
rosea
symb
freq
each
1
,1
for1
is1
a2
rose3
2 2
45
symb
freq
each
1
,1
for1
is1
a2
rose3
symb
freq
each
1
,1
for1
is1
a2
rose3
symb
freq
each1
,1
for1
is1
a2
rose3
2 2
45
![Page 29: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/29.jpg)
29
Example
each , for is
rosea0
0
0
0
1
1
1 1
symb
freq
each1
,1
for1
is1
a2
rose3
0 1
![Page 30: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/30.jpg)
30
Example
each , for is
rosea0
0
0
0
1
1
1 1
0 1
symb
freq
code
each1100
,1101
for1110
is1111
a200
rose301
![Page 31: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/31.jpg)
31
Canonical tree
each , for is
rosea0
0
0
0
1
1
1 1
0 1
- Height of the left subtree of any node is never - Height of the left subtree of any node is never smaller than that of the right subtreesmaller than that of the right subtree- All leaves are in increasing order of - All leaves are in increasing order of probabilities (frequencies) from left to rightprobabilities (frequencies) from left to right
![Page 32: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/32.jpg)
32
Advantages of canonical tree
• Smaller data for decoding– Non-canonical tree needs:
• Mapping table between symbols and codes
– Canonical tree needs:• (Sorted) list of symbols• A pair of number of symbols and numerical
value of the first code for each level– E.g., {(0, NA), (2, 2), (4, 0)}
• More efficient encoding/decoding
![Page 33: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/33.jpg)
33
Byte-oriented Huffman coding
• Use whole bytes instead of binary coding
256 symbols 256 symbols
254 empty nodes
256 symbols 2 symbols254 empty nodes
254 symbols
Non-optimal tree
Optimal tree
![Page 34: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/34.jpg)
34
Comparison of methods
![Page 35: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/35.jpg)
35
Compression of inverted files
• Inverted file: composed of – A vector containing all distinct words in
the text collection.– for each a list of documents in which
that word occurs.– Types of code:
• Unary• Elias-~• Elisa~o• Golomb
![Page 36: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/36.jpg)
36
ConclusionsConclusions
• Text transformation: meaning instead of strings– Lexical analysis– Stopwords– Stemming
• Text compression– Searchable– Random access– Model-coding
inverted files
![Page 37: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef](https://reader030.vdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5be81/html5/thumbnails/37.jpg)
37
Thanks….Any Questions.