07 - string processing
TRANSCRIPT
-
8/8/2019 07 - String Processing
1/27
CS4323 / 0708-2
YFA
Tersedia online di http://www.ittelkom.ac.id/staf/yanuar
http://www.ittelkom.ac. id/staf/yanuar
Institut Teknologi Telkom
-
8/8/2019 07 - String Processing
2/27
Query Languages:h mm n r L n
to information retrieval systems such as web indexes,
bibliographic catalogs and museum collection information.Objective: human readable and human writable; intuitive whilemaintaining the expressiveness of more complex languages.
Traditionally, query languages have fallen into two camps:
(a) Powerful and expressive languages which are not easilyreadable nor writable by non-experts (e.g. SQL and XQuery).
(b) Simple and intuitive languages not powerful enough toexpress complex concepts (e.g. CCL or Google's querylanguage).
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
3/27
The Common Query Language
The Common Query Language is maintained by the Z39.50International Maintenance A enc at the Librar of Con ress.
http://www.loc.gov/z3950/agency/zing/cql/
,
Gentle Introduction to CQL.
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
4/27
The Common Query Language: Examples
dinosaur
comp.sources.misc" ""the complete dinosaur""ext->u.generic"an
Booleans
dinosaur and bird or dinobird
(bird or dinosaur) and (feathers or scales)" "(((a and b) or (c not d) not (e or f and g)) and h not i) or j
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
5/27
The Common Query Language: Examples
title = dinosaurtitle = dinosaur and bird or dinobirddc.title = saurischiabath.title="the complete dinosaur"
.
srw.resultSet=bar
Index-set ma in definition of fields
>dc="http://www.loc.gov/srw/index-sets/dc"
dc.title=dinosaur and dc.author=farlow
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
6/27
The Common Query Language: Examples
rox m ty
The prox operator:prox/relation/distance/unit/ordering
Examples:
complete prox dinosaur [adjacent](caudal or dorsal) prox vertebra
ribs prox//0/sentence chevrons [same sentence]
ribs prox/>/0/paragraph chevrons [not adjacent]
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
7/27
The Common Query Language: Examples
Relations
year > 1998e a comp e e nosaur
title any "dinosaur bird reptile"title exact "the complete dinosaur"
publicationYear < 1980numberOfWheels 2.4
bioMass >= 100
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
8/27
The Common Query Language: Examples
Relation Modifiers
title all/stem "com lete dinosaur"title any / relevant "dinosaur bird reptile"title exact/fuzzy "the complete dinosaur"
The implementations of relevant and fuzzy are notdefined b the uer lan ua e.
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
9/27
The Common Query Language: Examples
dinosaur* [zero or more characters]
*sauriaman?raptor [exactly one character]man?raptor*" * "
char\* [literal "*"]
Word Anchoring
title="^the complete dinosaur" [beginning of field]
author="bakker^" [end of field]
author any "^kernighan ^ritchie ^thompson"
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
10/27
The Common Query Language: Examples
dc.author=(kern* or ritchie) and
(bath.title exact "the c programming language" ordc.title=elements prox///4 dc.title=programming) andsubject any/relevant "style design analysis"
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
11/27
The Common Query Language: Examples
dc.author=(kern* or ritchie) and
(bath.title exact "the c programming language" ordc.title=elements prox///4 dc.title=programming) andsubject any/relevant "style design analysis"
n recor s w ose au or n e u n ore sense nc u es e era word beginning kern or the word ritchie, and which have either theexact title (in the sense of the Bath profile) the c programminglanguage or a title containing the words elements and programmingnot more the four words apart, and whose subject is relevant to oneor more of the words style, design or analysis.
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
12/27
Regular Expressions in Java
. .
Classes for matching character sequences against patterns
s ecified b re ular ex ressions.
An instance of the Pattern class represents a regular expressionthat is specified in string form in a syntax similar to that used by
Perl.Instances of the Matcher class are used to match character
.
Input is provided to matchers via the CharSequence interface in
of input sources.
http://www.ittelkom.ac. id/staf/yanuar
S i S hi
-
8/8/2019 07 - String Processing
13/27
String Searching:N iv Al ri hm
Objective: Given a pattern, find any substring of a given text thatmatches the pattern.
p pa ern o e ma c em length of pattern p(characters)t the text to be searched
n length of t(characters)The naive algorithm examines the characters of txin sequence.
for j from 1 to n-m+1
if character j of t matches the first character
(compare following characters of t andp
until a
http://www.ittelkom.ac. id/staf/yanuar
St i S hi
-
8/8/2019 07 - String Processing
14/27
String Searching:Kn h-M rri -Pr Al ri hm
oncept: e na ve a gor t m s mo e , so t at w enever a part amatch is found, it may be possible to advance the character index,j,
by more than 1.
Example:
= "universit "
t = "the uniform commercial code ..."
j=5 after partial match continue here
To indicate how far to advance the character pointer, pis preprocessed
to create a table, which lists how far to advance against a given length.
In the example,jis advanced by the length of the partial match, 3.
http://www.ittelkom.ac. id/staf/yanuar
Si t Fil S ti l S h
-
8/8/2019 07 - String Processing
15/27
Signature Files: Sequential Searchwi h Inv r Fil
-qualifying items.
Advantages
Much faster than full text scanning -- 1 or 2 ordersof magnitude
o est space over ea -- 10% to 15% o e
Insertion is straightforward
Sequential searching is no good for very large files
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
16/27
Signature Files
Signature size. Number of bits in a signature, F.
Word si nature. A bit attern of size Fwith mbits set to 1and the others 0.
The word signature is calculated by a hash function.
Block. A sequence of text that contains Ddistinct words.
Block signature. The logical orof all the word signatures ina block of text.
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
17/27
Signature Files
Example
Word Signature
free 001 000 110 010
block signature 001 010 111 011
F =12 bits in a signature
=
D =2 words per block
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
18/27
Signature Files
A query term is processed by matching its signature againstthe block signature.
(a) If the term is in the block, its word signature will alwaysmatch the block signature.
wor s gna ure may ma c e oc s gna ure, u eword is not in the block. This is a false hit.
probability, Fd .
Frake, Section 4.2, page 47 discussed how to minimize .The rest of this chapter discusses enhancements to thebasic algorithm.
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
19/27
String Matching
.
Simple algorithm: Build an inverted index of all substrings of the
file names of the form *f,
Example: if the file name is foo.txt, search terms are:
foo.txt
oo.txto.txt.txttxt
xt
Lexicographic processing allows searching by any q.
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
20/27
Search for Substring
In some information retrieval applications, any substring can be asearch term.
Tries, using suffix trees, provide lexicographical indexes for allthe substrings in a document or set of documents.
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
21/27
Tries: Search for Substring
Basic concept
The text is divided into unique semi-infinite strings, or
sistrings. Each sistring has a starting position in the text, andcon nues o e r g un s un que.
The sistrings are stored in (the leaves of) a tree, the suffix. .
Each sistring can be associated with a location within adocument where the sistrin occurs. Subtrees below a certainnode represent all occurrences of the substring represented by
that node.Suffix trees have a size of the same order of magnitude as theinput documents.
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
22/27
Tries: Suffix Tree
following words:
be inbeginningbetween
break e rea
gin tween d k
null ning
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
23/27
Tries: Sistrings
A binary example
String: 01 100 100 010 111
Sistrings: 1 01 100 100 010 111
3 10 010 001 011 14 00 100 010 111
5 01 000 101 11
6 10 001 011 17 00 010 111
8 00 101 11
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
24/27
Tries: Lexical Ordering
7 00 010 111
8 00 101 11
1 01 100 100 010 111
3 10 010 001 011 1
Unique string indicated in blue
http://www.ittelkom.ac. id/staf/yanuar
-
8/8/2019 07 - String Processing
25/27
Trie: Basic Concept
0 1
00
11
00 0 11
7 5 1
0
0
1
6 30
1
http://www.ittelkom.ac. id/staf/yanuar
4 8
-
8/8/2019 07 - String Processing
26/27
Patricia Tree
0 1
000
112
2
00 0 1101
33
4
7 5 1 6 3
0 1
5
4 8 Single-descendant nodes are eliminated.
http://www.ittelkom.ac. id/staf/yanuar
.
-
8/8/2019 07 - String Processing
27/27
YFAApril 2008
. . .
Diadaptasi dari cs.cornell.edu
http://www.ittelkom.ac. id/staf/yanuar