querying structured text

56
CBU, Aug. '07 Region Algebra and sgrep 1 Querying Structured Text Querying Structured Text How to query the structure and contents How to query the structure and contents of structured documents? of structured documents? Text-Region Algebra, and search tool sgrep Text-Region Algebra, and search tool sgrep » a simple model and tool for a simple model and tool for content retrieval content retrieval of arbitrary structured text files, based on of arbitrary structured text files, based on concrete document syntax concrete document syntax W3C XQuery W3C XQuery » a rich a rich query and data manipulation language query and data manipulation language for for all types of XML data sources, based on all types of XML data sources, based on conceptual content of XML documents ("XML conceptual content of XML documents ("XML Information Set") Information Set")

Upload: edana

Post on 05-Feb-2016

78 views

Category:

Documents


0 download

DESCRIPTION

Querying Structured Text. How to query the structure and contents of structured documents? Text-Region Algebra, and search tool sgrep a simple model and tool for content retrieval of arbitrary structured text files, based on concrete document syntax W3C XQuery - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 1

Querying Structured TextQuerying Structured Text

How to query the structure and contents of How to query the structure and contents of structured documents?structured documents?Text-Region Algebra, and search tool sgrepText-Region Algebra, and search tool sgrep

» a simple model and tool for a simple model and tool for content retrievalcontent retrieval of arbitrary of arbitrary structured text files, based on concrete document syntaxstructured text files, based on concrete document syntax

W3C XQueryW3C XQuery » a rich a rich query and data manipulation languagequery and data manipulation language for all for all

types of XML data sources, based on conceptual content types of XML data sources, based on conceptual content of XML documents ("XML Information Set")of XML documents ("XML Information Set")

Page 2: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 2

Region Algebra (and sgrep)Region Algebra (and sgrep)

the model: the model: region algebraregion algebra– relatively low-level but appropriate for e.g. XMLrelatively low-level but appropriate for e.g. XML– documents seen as contiguous portions called documents seen as contiguous portions called regionsregions

an implementation: an implementation: sgrepsgrep (“(“structured grep”)structured grep”)– command-line based search tool (Unix/Linux)command-line based search tool (Unix/Linux)

» version 0.99: basic featuresversion 0.99: basic features» version 1.94a: indexing, SGML/XML/HTML support, version 1.94a: indexing, SGML/XML/HTML support,

other additional featuresother additional features 1.92 also as Win32 binaries; Functional?1.92 also as Win32 binaries; Functional?

– extracts portions of files/input streams extracts portions of files/input streams

Page 3: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 3

Region algebra: BackgroundRegion algebra: Background

PatPatTMTM (University of Waterloo; OpenText) (University of Waterloo; OpenText)– efficient full-text index (suffix array)efficient full-text index (suffix array)– match-point delimited regionsmatch-point delimited regions

» result sets of non-overlapping regions only; result sets of non-overlapping regions only; overlapping results represented as their start points overlapping results represented as their start points semantic problems semantic problems

‘‘‘‘generalized concordance lists’’ generalized concordance lists’’ – Clarke, Cormack and Burkowski 1995Clarke, Cormack and Burkowski 1995– overlaps but no nesting in resultsoverlaps but no nesting in results

Page 4: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 4

Nested region algebraNested region algebra

Jaakkola & Kilpeläinen, U. of Helsinki 1995-1999Jaakkola & Kilpeläinen, U. of Helsinki 1995-1999 retrieval of document components as dynamically retrieval of document components as dynamically

defined regions, based on their mutual ordering defined regions, based on their mutual ordering and nestingand nesting

no restrictions on the length, overlapping or no restrictions on the length, overlapping or nesting of regionsnesting of regions

no restrictions on document formatsno restrictions on document formats– regular markup formats (a’la XML) most appropriateregular markup formats (a’la XML) most appropriate

Page 5: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 5

Design principles and goalsDesign principles and goals

GeneralityGenerality– minimize restrictions on targets and useminimize restrictions on targets and use

SimplicitySimplicity– based on a minimum of conceptsbased on a minimum of concepts

Algorithmic efficiencyAlgorithmic efficiency– O(|Text|), in most casesO(|Text|), in most cases

Well defined semanticsWell defined semantics

Page 6: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 6

Basics of region algebra data modelBasics of region algebra data model

Set of Set of positionspositions U= U= {0, …, {0, …, nn-1} comprising a text -1} comprising a text of length of length nn– characters/bytes (sgrep), or wordscharacters/bytes (sgrep), or words

RegionsRegions: : contiguous (non-empty) sequences of contiguous (non-empty) sequences of positions (normally fragments of files)positions (normally fragments of files)– typically occurrences of a string,typically occurrences of a string,

or document elements or document elements– regionregion a a = { = {s, ss, s+1+1…, e…, e} denoted by (} denoted by (s, es, e) )

or (or (a.s, a.ea.s, a.e), by the ), by the startstart and and end positionend position of region of region aa

Page 7: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 7

Positions in sgrepPositions in sgrep

Sample file (under Linux):Sample file (under Linux):$ cat abra.txt$ cat abra.txtabracadabra$abracadabra$

Regions normally occurrences of a string:Regions normally occurrences of a string:$ sgrep -o"(%s,%e)" '"abra"' abra.txt$ sgrep -o"(%s,%e)" '"abra"' abra.txt(0,3)(7,10)(0,3)(7,10)– output format template output format template ""(%s,%e)(%s,%e)" " applied to each applied to each

result region ("show start position and end position in result region ("show start position and end position in parentheses")parentheses")

Page 8: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 8

Positions in sgrep (2)Positions in sgrep (2)

The first position:The first position:$ sgrep 'start' abra.txt$ sgrep 'start' abra.txtaa

The last position:The last position:$ sgrep 'end' abra.txt$ sgrep 'end' abra.txt$$

The entire document (more on operator 'The entire document (more on operator '....'' later):later):$ sgrep 'start .. end' abra.txt$ sgrep 'start .. end' abra.txtabracadabra$abracadabra$ also:also:

sgrep 'file("*")' abra.txtsgrep 'file("*")' abra.txt

Page 9: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 9

Relationships of regions Relationships of regions aa and and bb

Language based on relationships between Language based on relationships between regions:regions:

1)1) a a precedesprecedes b, b b, b followsfollows a a::– a a ends beforeends before b b starts,starts, a.e < b.s a.e < b.s

2) 2) a a is is included inincluded in b b ( (b b containscontains a): a a): a bb ( (b b aa):):– b.s b.s a.sa.s, and , and a.e a.e b.eb.e– proper containmentproper containment: : a a bb, and , and b b aa

3) if neither of the above, a and b 3) if neither of the above, a and b overlapoverlap

Page 10: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 10

Region algebraRegion algebra

Region algebra is a set-valued languageRegion algebra is a set-valued language– the value of any expression is a set of regionsthe value of any expression is a set of regions– c.f. relational algebra: set of rows/tuplesc.f. relational algebra: set of rows/tuples

Operations map Operations map region setsregion sets to to region sets region sets a a compositional compositional language: operands of language: operands of expressions can be any other expressionsexpressions can be any other expressions

arbitrarily complex queries can be formed arbitrarily complex queries can be formed

Page 11: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 11

Building new regionsBuilding new regions

Nested regions (e.g. document elements)Nested regions (e.g. document elements)– ‘‘‘‘followed-by’’ operatorsfollowed-by’’ operators

Non-nested regionsNon-nested regions– ‘‘‘‘quote’’ operatorsquote’’ operators

By removing overlap with other regionsBy removing overlap with other regions– ‘‘‘‘extracting’’ operatorextracting’’ operator

Page 12: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 12

‘‘‘‘Followed-by’’Followed-by’’

Most central and characteristic of region Most central and characteristic of region algebra and sgrep operatorsalgebra and sgrep operators

allows dynamic creation of regions from allows dynamic creation of regions from their bounding regionstheir bounding regions

generalises the way how parentheses are generalises the way how parentheses are matched, starting from inside, always matched, starting from inside, always matching the closest unmatched onesmatching the closest unmatched ones

Page 13: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 13

Followed-by exampleFollowed-by example

$ cat expr.txt$ cat expr.txt(1(2(3))(4))(1(2(3))(4))

Extract parenthesised sub-expressions:Extract parenthesised sub-expressions:$ sgrep -o"%r$ sgrep -o"%r////" '"(" .. ")"' expr.txt" '"(" .. ")"' expr.txt(1(2(3))(4))(1(2(3))(4))////(2(3))(2(3))////(3)(3)////(4)(4)////– without thewithout the output format switch output format switch -o-o, each position , each position

covered by the result regions is displayed only oncecovered by the result regions is displayed only once

Page 14: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 14

Value of Value of A .. BA .. B formally formally

A A andand B B sets of regionssets of regions AA .. .. BB = {( = {(a.s, b.ea.s, b.e)} )} UU ( ( ( (AA-{-{aa}) .. (}) .. (BB-{-{bb}) ),}) ),

where where a a A, b A, b BB such that such that a.e < b.sa.e < b.s and the and the a.e .. b.sa.e .. b.s distance is minimal distance is minimal – (and also distance btw (and also distance btw a.s a.s andand b.e b.e is minimal, if there is minimal, if there

are otherwise equidistant pairs)are otherwise equidistant pairs) – empty set, if there are no such empty set, if there are no such a a A A andand b b B B

each each a a A A and eachand each b b B B produces at most produces at most one result region (one result region (a.s, b.ea.s, b.e))

Page 15: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 15

Regions of document elementsRegions of document elements

Regions delimited by start and end tags:Regions delimited by start and end tags:$ sgrep '"<TITLE>" .. "</TITLE>"' sgrepman.html$ sgrep '"<TITLE>" .. "</TITLE>"' sgrepman.html<TITLE> sgrep - Manual page </TITLE><TITLE> sgrep - Manual page </TITLE>

Delimiting regions can be left-out:Delimiting regions can be left-out:– using also the built-in markup scanner to recognise tags: using also the built-in markup scanner to recognise tags:

$ sgrep 'stag("TITLE") $ sgrep 'stag("TITLE") _._. etag("TITLE")' sgrepman.html etag("TITLE")' sgrepman.html sgrep - Manual page </TITLE> sgrep - Manual page </TITLE>$ sgrep 'stag("TITLE") $ sgrep 'stag("TITLE") ._._ etag("TITLE")' sgrepman.html etag("TITLE")' sgrepman.html <TITLE> sgrep - Manual page <TITLE> sgrep - Manual page$ sgrep 'stag("TITLE") $ sgrep 'stag("TITLE") ____ etag("TITLE")' sgrepman.html etag("TITLE")' sgrepman.html sgrep - Manual page sgrep - Manual page

(possible empty regions are excluded from (possible empty regions are excluded from A __ BA __ B) )

Page 16: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 16

‘‘‘‘Quote’’ operatorsQuote’’ operators

Disjoint regions are often delimited by identical Disjoint regions are often delimited by identical start and end markersstart and end markers– e.g. string constants in programming languagese.g. string constants in programming languages

$ cat hello.txt$ cat hello.txt"Hello," said Bob, "nice day!""Hello," said Bob, "nice day!"$ sgrep '"\"" $ sgrep '"\"" quotequote "\""' hello.txt "\""' hello.txt "Hello,""nice day!" "Hello,""nice day!"

– starting or ending regions, or both, can be excluded starting or ending regions, or both, can be excluded (similarly to(similarly to _., ._ _., ._ and and ____) using) using _quote_quote, , quote_quote_ and and _quote__quote_

\" \" == \#34 \#34 == \#x22 \#x22

Page 17: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 17

Extracting overlapping regionsExtracting overlapping regions

A A extractingextracting B B = regions that result by removing = regions that result by removing from regions of from regions of AA any overlap with regions in any overlap with regions in BB

Contents of sections without subsections:Contents of sections without subsections:

sgrep 'stag("sec") .. "etag("sec") extracting sgrep 'stag("sec") .. "etag("sec") extracting (stag("subsec") .. etag("subsec"))' ex.xml(stag("subsec") .. etag("subsec"))' ex.xml

Content without markup tags:Content without markup tags:

sgrep 'start .. end extracting sgrep 'start .. end extracting ("<" quote ">")' *.html("<" quote ">")' *.html

Note: No operator precedence; Parenthesised sub-Note: No operator precedence; Parenthesised sub-expressions are evaluated first, otherwise left-to-rightexpressions are evaluated first, otherwise left-to-right

Page 18: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 18

Containment conditionsContainment conditions

Allow selecting regions based on their Allow selecting regions based on their context or contentcontext or content

Operators for selecting regions that Operators for selecting regions that – appear / do not appear in a given contextappear / do not appear in a given context– contain / do not contain a region of another setcontain / do not contain a region of another set

Similar operators in other variations of Similar operators in other variations of region algebraregion algebra

Page 19: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 19

Containment operators formallyContainment operators formally

A A inin B: { B: {aaAA b b BB:: a a b b

A A not innot in B: { B: {aaAA b b BB:: a a b b

A A containingcontaining B: { B: {aaAA b b BB:: a a b b

A A not containingnot containing B: { B: {aaAA b b BB:: a abb

NB: NB: properproper containmentcontainment

Page 20: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 20

Set operationsSet operations

Union, intersection and difference:Union, intersection and difference: A A oror B, A B, A equalequal B, A B, A not equalnot equal B B

Rather seldom needed (except for ‘Rather seldom needed (except for ‘oror’);’);containment conditions otherwise sufficient containment conditions otherwise sufficient for expressing Boolean retrieval for expressing Boolean retrieval (See next)(See next)

Page 21: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 21

Expressing Boolean queriesExpressing Boolean queries

HTML files (HTML files (%f%f) containing “cat” AND “dog”:) containing “cat” AND “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end

containing "cat" containing "dog"' *.htmlcontaining "cat" containing "dog"' *.html

HTML files containing “cat” OR “dog”:HTML files containing “cat” OR “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end

containing ("cat" or "dog")' *.html containing ("cat" or "dog")' *.html

HTML files containing “cat” but NO “dog”:HTML files containing “cat” but NO “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end containing "cat" not containing "dog"' *.html containing "cat" not containing "dog"' *.html

Page 22: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 22

Additional operations on region setsAdditional operations on region sets

concatconcat(A): minimal set of regions covering (A): minimal set of regions covering exactly the regions of Aexactly the regions of A– default result formatting of sgrepdefault result formatting of sgrep

innerinner(A) = A not containing A(A) = A not containing A– the innermost regions in Athe innermost regions in A

outerouter(A) = A not in A(A) = A not in A– the outermost regions in Athe outermost regions in A– e.g. to get the document root element:e.g. to get the document root element:

outerouter((elementselements))

Page 23: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 23

Structural retrievalStructural retrieval

Arbitrarily complex queries, e.g., ‘‘extract titles of Arbitrarily complex queries, e.g., ‘‘extract titles of pages which mention cats, but not in the title of the pages which mention cats, but not in the title of the page’’:page’’: sgrep '"<TITLE>" .. "</TITLE>" in sgrep '"<TITLE>" .. "</TITLE>" in (start .. end containing (start .. end containing ("cat" not in ("<TITLE>" .. "</TITLE>"))' ("cat" not in ("<TITLE>" .. "</TITLE>"))' *.html *.html

But notice: The model supports (and sgrep implements) only But notice: The model supports (and sgrep implements) only extracting the regions that satisfy the query, in orderextracting the regions that satisfy the query, in order– restricted modification of regions with optionrestricted modification of regions with option -o-o

Page 24: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 24

Sgrep: some useful optionsSgrep: some useful options

-a-a: Display result regions surrounded by the rest of : Display result regions surrounded by the rest of the file; useful with the file; useful with -o-o

-o-o <style><style>: Set output format string, possibly : Set output format string, possibly containing following place holders:containing following place holders:– %f%f: name of the file containing the start of region,: name of the file containing the start of region,– %s%s, , %e%e: start and end position,: start and end position,– %r%r: the content (text) of the region,: the content (text) of the region,– %n%n: the ordinal number of the region, (+ few others); : the ordinal number of the region, (+ few others);

Output once for each result region, with current values Output once for each result region, with current values substituted for place holderssubstituted for place holders

Page 25: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 25

Sgrep: more useful optionsSgrep: more useful options

-c-c only count the matching regions only count the matching regions -h-h a short help (list of options) a short help (list of options) -f file-f file read commands from read commands from filefile -i-i ignore case distinctions in phrases ignore case distinctions in phrases -S-S stream mode (regions extend across files) stream mode (regions extend across files) -T-T//-t-t show statistics about things done/time show statistics about things done/time

spentspent More in the man page and READMEMore in the man page and README

Page 26: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 26

MacrosMacros

Shorthand notations for complex queries can be Shorthand notations for complex queries can be given as given as m4m4 definitions definitions

In a file, e.g., In a file, e.g., ELEMS.m4ELEMS.m4::

define(NAMED_STAG,("<$1>" or define(NAMED_STAG,("<$1>" or (("<$1 " or "<$1\t" or "<$1\n") quote ">")))(("<$1 " or "<$1\t" or "<$1\n") quote ">")))

define(NAMED_ELEMS,(NAMED_STAG($1) .. define(NAMED_ELEMS,(NAMED_STAG($1) .. "</$1>"))"</$1>"))

Page 27: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 27

Macros in QueriesMacros in Queries

Then, say, books by Gray can be queried:Then, say, books by Gray can be queried:

$ sgrep -f ELEMS.m4 $ sgrep -f ELEMS.m4 -e-e 'NAMED_ELEMS(book) 'NAMED_ELEMS(book) containing (NAMED_ELEMS(author) containing (NAMED_ELEMS(author)

containing "Gray")' bib.xmlcontaining "Gray")' bib.xml

A built-in scanner (see later) eliminates the use of macros A built-in scanner (see later) eliminates the use of macros for XML/HTML tokens (like above), but macros are useful for XML/HTML tokens (like above), but macros are useful for many non-XML-related queriesfor many non-XML-related queries

Page 28: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 28

Applications of sgrepApplications of sgrep

Simple document assembly Simple document assembly – Jaakkola and Kilpeläinen: Using sgrep for querying Jaakkola and Kilpeläinen: Using sgrep for querying

structured text files, structured text files, SGML Finland’96SGML Finland’96

Analysing element structures Analysing element structures As a Web site search engine As a Web site search engine

– E.g. Chapter 7 in Leventhal, Lewis & Fuchs: E.g. Chapter 7 in Leventhal, Lewis & Fuchs: Designing Designing XML Internet ApplicationsXML Internet Applications

– Using sgrep indexer to speed up querying static filesUsing sgrep indexer to speed up querying static files

Page 29: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 29

An assembly system prototypeAn assembly system prototype

Page 30: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 30

Analysing element structuresAnalysing element structures

Structure of element instances in a collection? Structure of element instances in a collection? The DTD does not tell!The DTD does not tell!

Element statistics:Element statistics: lu 1751 (min: 67, avg: 6849.37, max: 115620)lu 1751 (min: 67, avg: 6849.37, max: 115620) 41 -> 98 huom 41 -> 98 huom 1751 -> 1751 nu 1751 -> 1751 nu 1748 -> 1748 ot 1748 -> 1748 ot

1751 1751 lulu elements, with shown minimum, average and elements, with shown minimum, average and maximum lenghts; 41 contain maximum lenghts; 41 contain huomhuom elements directly, and elements directly, and 98 98 huomhuom elements are children ofelements are children of lulu elementselements..

Page 31: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 31

Structure analysis: implementation Structure analysis: implementation

Generated by a Python script Generated by a Python script each datum (element type name, count, each datum (element type name, count,

length) computed by generating and length) computed by generating and executing an sgrep queryexecuting an sgrep query

Page 32: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 32

Implementation of sgrepImplementation of sgrep

Steps in executing an sgrep query?Steps in executing an sgrep query?1. Query preprocessing 1. Query preprocessing

» macro expansion; external & optionalmacro expansion; external & optional

2. Query parsing2. Query parsing

3. Query optimisation3. Query optimisation

4. String retrieval on the text4. String retrieval on the text

5. Operator evaluation5. Operator evaluation

6. Data delivery6. Data delivery» Outputting of result regionsOutputting of result regions

Page 33: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 33

1. Query preprocessing 1. Query preprocessing

Consider the below query Q:Consider the below query Q:sgrep -f macros -e'outer(S in S)' a.xml b.xmlsgrep -f macros -e'outer(S in S)' a.xml b.xml

With suitable macro definitions Q becomes With suitable macro definitions Q becomes outer(("<sec>" .. "</sec>") in outer(("<sec>" .. "</sec>") in

("<sec>" .. "</sec>"))("<sec>" .. "</sec>"))

m 4 preprocessormacros

expanded query

Q

Page 34: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 34

2. Query parsing2. Query parsing

– Query into an Query into an operator tree:operator tree:» operators as internal nodes, string phrases as operators as internal nodes, string phrases as

leavesleaves

outerouter

inin

"<sec>""<sec>"

....

"</sec>""</sec>" "<sec>""<sec>"

....

"</sec>""</sec>"

Page 35: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 35

3. Query optimisation 3. Query optimisation

– common subexpression eliminationcommon subexpression elimination– operator tree to an operator tree to an operator graphoperator graph (DAG) (DAG)

» e.g, an operator tree of 735 nodes (for BibTeX e.g, an operator tree of 735 nodes (for BibTeX records) reduces to a DAG of 103 nodesrecords) reduces to a DAG of 103 nodes

outerouter

inin

"<sec>""<sec>"

....

"</sec>""</sec>"

Page 36: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 36

4. String retrieval (1)4. String retrieval (1)

– A deterministic Aho-Corasick automaton A deterministic Aho-Corasick automaton M M built of string patterns in the querybuilt of string patterns in the query

M:

>s e c

>s e c<

/

Page 37: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 37

String retrieval (2)String retrieval (2)

– Input files (or streams) scanned using Input files (or streams) scanned using MM» each pattern simultaneouslyeach pattern simultaneously» region list region list of pattern occurrences attached to of pattern occurrences attached to

the leaves of the operator graphthe leaves of the operator graphouterouter

inin

"<sec>""<sec>"

....

"</sec>""</sec>"[(5,9), (26,30), …][(5,9), (26,30), …] [(20,25), (100,105), …][(20,25), (100,105), …]

Alternatively, region lists can be Alternatively, region lists can be obtained by a look-up from a obtained by a look-up from a pre-computed index (sgrep 2), pre-computed index (sgrep 2), without scanning target fileswithout scanning target files

Page 38: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 38

5. Operator evaluation5. Operator evaluation

Bottom-up traversal of the operator graph Bottom-up traversal of the operator graph – value of node: a value of node: a region listregion list

» of (start, end) index pairsof (start, end) index pairs» internal representation of region setsinternal representation of region sets» maintained in increasing (start, end) ordermaintained in increasing (start, end) order

– each node evaluated onceeach node evaluated once– (sub)queries with large results may require lots (sub)queries with large results may require lots

of main memory!of main memory!

Page 39: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 39

Operator graph evaluationOperator graph evaluation

outerouter

inin

"<sec>""<sec>"

....

"</sec>""</sec>"

[(5,9), (26,30), …][(5,9), (26,30), …] [(20,25), (100,105), …][(20,25), (100,105), …]

[(5,25), (26,105), …][(5,25), (26,105), …]

[(1000,1500), (1010,1200), …][(1000,1500), (1010,1200), …]

[(1000,1500), (1800,1923), …][(1000,1500), (1800,1923), …]

Page 40: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 40

6. Data delivery6. Data delivery

Finally, the regions in the result region list Finally, the regions in the result region list are output in document orderare output in document order– text retrieved by indexing target files with start text retrieved by indexing target files with start

and end positions of result regionsand end positions of result regions– result regions possibly modified according to result regions possibly modified according to

the output style specification (option the output style specification (option -o-o))

Page 41: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 41

Evaluating region algebra operatorsEvaluating region algebra operators

Synchronised (merge-like, linear) traversal Synchronised (merge-like, linear) traversal of operand lists, of operand lists, except that ...except that ...– A A .... B B requires sorting of requires sorting of AA (by region end (by region end

positions) if positions) if AA contains nested regions contains nested regions– A A extractingextracting B B may require considering the may require considering the

same regions of A multiple times same regions of A multiple times

Page 42: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 42

Evaluation complexityEvaluation complexity

Worst-case time Worst-case time OO((t nt n), where ), where nn maximum maximum size and size and t t maximum ''thickness'' (number of maximum ''thickness'' (number of regions overlapping at any position) of any regions overlapping at any position) of any region setregion set– Jaakkola and Kilpeläinen: Nested Text-Region Jaakkola and Kilpeläinen: Nested Text-Region

Algebra, January 1999.Algebra, January 1999. In practise linear time;In practise linear time; comparable to Unix grepscomparable to Unix greps

Page 43: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 43

Extensions in sgrep 2Extensions in sgrep 2

(pre-release v. 1.94a)(pre-release v. 1.94a) documented (only!) in the README filedocumented (only!) in the README file XML/HTML/SGML supportXML/HTML/SGML support

– scanner to recognise markup tokensscanner to recognise markup tokens » SGML/HTML-mode (default)SGML/HTML-mode (default)

markup names converted to markup names converted to UPPER CASEUPPER CASE

» XML-mode: tag names case-sensitiveXML-mode: tag names case-sensitive– simple-minded parser to recognise elementssimple-minded parser to recognise elements– 16-bit wide characters in XML documents16-bit wide characters in XML documents– NBNB: currently no expanding of entity references; : currently no expanding of entity references;

No validation or well-formedness checkingNo validation or well-formedness checking

Unaware of Unaware of namespaces namespaces

Page 44: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 44

Extended sgrep features (2)Extended sgrep features (2)

Direct containment:Direct containment:– A A childreningchildrening B, A B, A parentingparenting B B

Restricting the number of result regionsRestricting the number of result regions– first(n, E), last(n, E) first(n, E), last(n, E)

Truncating result regionsTruncating result regions– first_bytes(n, E), last_bytes(n, E)first_bytes(n, E), last_bytes(n, E)

Nearness operatorsNearness operators– A A near(n)near(n) B, A B, A near_before(n)near_before(n) B B

Indexing of both structure and contentIndexing of both structure and content

Page 45: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 45

Querying markup and structureQuerying markup and structure

Markup tokens can be queried:Markup tokens can be queried:turni> sgrep -o"%r\n" -g xml 'turni> sgrep -o"%r\n" -g xml 'stagstag("email") ("email") .. .. etagetag("email")' REC-xml-19980210.xml("email")' REC-xml-19980210.xml<email <email href="mailto:[email protected]">tbray@textualityhref="mailto:[email protected]">[email protected]</email>.com</email><email <email href="mailto:[email protected]">jeanpa@microsofthref="mailto:[email protected]">[email protected]</email>.com</email><email <email href="mailto:[email protected]">[email protected]</email>href="mailto:[email protected]">[email protected]</email>

Page 46: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 46

Querying markup (2)Querying markup (2)

Arbitrary suffix of strings in markup: Arbitrary suffix of strings in markup: "*""*" E.g., count arbitrary empty-elements:E.g., count arbitrary empty-elements:

turni> sgrep -c 'stag("*") turni> sgrep -c 'stag("*") containing "/>"' REC-xml-19980210.xml containing "/>"' REC-xml-19980210.xml

6868

Page 47: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 47

Querying markup (3)Querying markup (3)

Attributes can be searched by name and by Attributes can be searched by name and by valuevalue– E.g. start tags having attribute E.g. start tags having attribute keykey with value with value

''Aho/UllmanAho/Ullman':':turni> sgrep -o"%r\n" -g xml \turni> sgrep -o"%r\n" -g xml \> 'stag("*") containing > 'stag("*") containing ( (attributeattribute("key") ("key") containing containing attvalueattvalue("Aho/Ullman"))' \("Aho/Ullman"))' \> > REC-xml-19980210.xmlREC-xml-19980210.xml<bibl id='Aho' key='Aho/Ullman'><bibl id='Aho' key='Aho/Ullman'>

Page 48: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 48

Navigating in structure hierarchyNavigating in structure hierarchy

Top-level components in the Spec?Top-level components in the Spec?turni> sgrep -c 'outer(elements turni> sgrep -c 'outer(elements

childreningchildrening elements)' \ elements)' \REC-xml-19980210.xmlREC-xml-19980210.xml

3 3

Elements withElements with headhead oror headerheader as a child?as a child?turni> sgrep -cg xml 'elements \turni> sgrep -cg xml 'elements \parentingparenting (stag("head*") .. \ (stag("head*") .. \ etag("head*"))' REC-xml-19980210.xml etag("head*"))' REC-xml-19980210.xml

129129

Page 49: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 49

Parenting and childrening formallyParenting and childrening formally

A parenting BA parenting B = =

A childrening B =A childrening B =

(only for (only for AA without overlapping regions) without overlapping regions)

a A b B a b a A a a b : : and

a A b B a b a A a a b : : and

Page 50: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 50

Restricting result regions Restricting result regions

Display first 30 bytes of first 3 paragraphs:Display first 30 bytes of first 3 paragraphs:

turni> sgrep -o"%r ...\n" -g xml \turni> sgrep -o"%r ...\n" -g xml \''first_bytesfirst_bytes(30, (30, firstfirst(3, stag("p") ..(3, stag("p") .. etag("p")))' REC-xml-19980210.xml etag("p")))' REC-xml-19980210.xml<p>The Extensible Markup Langua...<p>The Extensible Markup Langua...<p>This document has been revie...<p>This document has been revie...<p><p>This document specifies a s...This document specifies a s...

Page 51: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 51

Restricting results (2)Restricting results (2)

Get end tags of the children of the document Get end tags of the children of the document element:element:turni> sgrep -o"%r\n" -g xml \turni> sgrep -o"%r\n" -g xml \'etag("*") containing 'etag("*") containing last_byteslast_bytes(1, (1,

outer(elements in elements))' \outer(elements in elements))' \REC-xml-19980210.xmlREC-xml-19980210.xml

</header></header></body></body>

</back></back>

Page 52: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 52

Nearness ConditionsNearness Conditions

AA near(near(nn)) BB: minimal regions that contain both : minimal regions that contain both some some aa AA and some and some b b B B separated by at separated by at most most nn bytes bytes

$ sgrep -o"$ sgrep -o"[[%r%r]]\n" '"Adam" near(80) "Eve"' ot.xml\n" '"Adam" near(80) "Eve"' ot.xml[[Adam called his wife's name EveAdam called his wife's name Eve]][[Eve; because she was the mother of all living.Eve; because she was the mother of all living.</v></v><v>Unto Adam<v>Unto Adam]][[Adam knew EveAdam knew Eve]]

Page 53: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 53

Ordered NearnessOrdered Nearness

Can request "Adam" and "Eve" to occur close to, Can request "Adam" and "Eve" to occur close to, and and beforebefore "LORD" "LORD"

$ sgrep -o"$ sgrep -o"[[%r%r]]\n" '"Adam" near(80) "Eve" \n" '"Adam" near(80) "Eve" near_before(80) near_before(80) "LORD""LORD"'' ot.xml ot.xml

[[Eve; because she was the mother of all living.Eve; because she was the mother of all living.</v></v><v>Unto Adam also and to his wife did the LORD<v>Unto Adam also and to his wife did the LORD]]

Page 54: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 54

Indexing in sgrep 2.0Indexing in sgrep 2.0

Both the structure and content (words) indexed Both the structure and content (words) indexed (max file size 2 GB)(max file size 2 GB)

Creates a separate index (Creates a separate index (postings filepostings file) ) – terms with region lists of their occurrencesterms with region lists of their occurrences– index a compressed binary file, size 30-60% of the index a compressed binary file, size 30-60% of the

original filesoriginal files

Makes access to static document collections Makes access to static document collections much more efficientmuch more efficient

Page 55: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 55

Indexing exampleIndexing example

IndexIndex a 64-fold copy of the XML Rec a 64-fold copy of the XML Rec ((S64S64, 10 MB), 10 MB)::> sgrep > sgrep -I-I -g xml -c S64.index S64 -g xml -c S64.index S64> ls -l S64.index> ls -l S64.index…… 3621954 Feb 8 21:01 S64.index3621954 Feb 8 21:01 S64.index

How often "Bray" mentioned in file How often "Bray" mentioned in file S64S64??> time sgrep -c 'word("Bray")' S64> time sgrep -c 'word("Bray")' S643203203.80user 0.07system 0:03.87elapsed3.80user 0.07system 0:03.87elapsed

The same using the index:The same using the index:> time sgrep -c -x > time sgrep -c -x S64.indexS64.index 'word("Bray")' 'word("Bray")'3203200.02user 0.01system 0:00.03elapsed0.02user 0.01system 0:00.03elapsed

over 100-fold speed-up!over 100-fold speed-up!

Page 56: Querying Structured Text

CBU, Aug. '07 Region Algebra and sgrep 56

A Limitation of IndexingA Limitation of Indexing

> sgrep -c '"Bray"' S64> sgrep -c '"Bray"' S64384384

Not a full-text index:Not a full-text index:> sgrep -c > sgrep -c -x -x S64.indexS64.index '"Bray"' '"Bray"'00

Complete words (+ their prefixes) available:Complete words (+ their prefixes) available:> sgrep -c -x > sgrep -c -x S64.indexS64.index ' 'wordword("Bray")'("Bray")'320320> sgrep -c -x > sgrep -c -x S64.indexS64.index ' 'wordword("Br("Br**")'")'576 576

(( 64 times in XML comments) 64 times in XML comments)

!!!!