modeling and querying web datakolari1/mining/papers/vakali...data, such as type assignment or query...

21
Modeling and Querying Web Data: A Constraint-Based Logic Approach Evimaria Terzi* Mohand-Sad Hacid Athena Vakali* *Department of Computer Sciences Laboratoire dIngØnierie des SystLmes dInformation Purdue University 20, avenue Albert Einstein West Lafayette, IN 47907 USA 69621 Villeurbanne France {edt, vakali}@cs.purdue.edu [email protected] 1 Introduction New applications and the expanding rate of data being stored and circulated over distributed systems and the Internet has resulted in the development of new database technologies. The need for flexibility and accuracy in data representation and manipulation is vital in Web databases [4,34]. The most important characteristic of these applications is the diversity of the data being stored and the lack of fixed and rigid structure (i.e. schema). The core problem in semistructured data is that the structure is not fully known and this leads to querying the data in a content-based fashion, as opposed to the structure-based queries over relational databases. The special features of semistructured data define a particularly interesting domain for query languages. Computations over semistructured data can easily become infinite, even when the underlying alphabet is finite. Query languages for semistructured data have been investigated in the context of algebraic programming [2,10]. In this chapter, a different approach to modeling and querying semistructured and Web data is presented, based on Feature Logics ([44,24,45]) and not on algebraic programming. More specifically, we propose a rule-based constraint language for manipulating semistructured data. The proposed language has both declarative and operational semantics which are based on fix-point theory, as in the typical logic programming [31]. Moreover, the issue of relaxing queries by extending the proposed query language will be investigated. Finally, the chapter presents a special case study for the proposed framework (based on XML data structure). 1.1 Research Review Semistructured and XML data modeling and retrieval earlier research efforts have focused on query languages development, constraint-based approaches, query relaxation and XML specific query languages : Semistructured data is usually modeled as a labeled graph, in which nodes correspond to the data objects and the edges to their attributes. Most query languages proposed for semistructured data navigate the data using Regular Path Expressions, thus traversing arbitrary long paths in the graph. In [2] a query language, called Lorel, for semistructured data is defined by extending [16] with powerful and flexible path expressions, which allow querying with no precise structure knowledge. Path expressions are built from labels and wild-card (place-holders) using regular expressions, allowing the user to specify rich patterns that are matched to actual paths in the database graph. One of the limits of this language is that it does not allow to express recursive queries over the database graphs. In the language presented in [10], UnQL, is loosely related to Lorel, allowing to query data organized as a root, edge-labeled graph. A primary feature of UnQL is a powerful construct called traverse that allows restructuring of trees to arbitrary depth. The language of terms uses variables ranging over trees or over edge labels. A tree is seen as a set of edge/sub-tree pairs. In [17], a more rigidly typed approach is followed in a language called OQL-doc, with the model having strong similarity to OEM (due to the introduction of heterogeneous collections). Optimizing the evaluation of generalized path expressions is considered in [18] where the optimization is based on two object

Upload: others

Post on 31-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

Modeling and Querying Web Data: A Constraint-Based Logic Approach

Evimaria Terzi* Mohand-Saïd Hacid Athena Vakali*

*Department of Computer Sciences Laboratoire dIngénierie des Systèmes dInformation Purdue University 20, avenue Albert Einstein West Lafayette, IN 47907 USA 69621 Villeurbanne France edt, [email protected] [email protected]

1 Introduction New applications and the expanding rate of data being stored and circulated over distributed systems and the Internet has resulted in the development of new database technologies. The need for flexibility and accuracy in data representation and manipulation is vital in Web databases [4,34]. The most important characteristic of these applications is the diversity of the data being stored and the lack of fixed and rigid structure (i.e. schema). The core problem in semistructured data is that the structure is not fully known and this leads to querying the data in a content-based fashion, as opposed to the structure-based queries over relational databases. The special features of semistructured data define a particularly interesting domain for query languages. Computations over semistructured data can easily become infinite, even when the underlying alphabet is finite. Query languages for semistructured data have been investigated in the context of algebraic programming [2,10]. In this chapter, a different approach to modeling and querying semistructured and Web data is presented, based on Feature Logics ([44,24,45]) and not on algebraic programming. More specifically, we propose a rule-based constraint language for manipulating semistructured data. The proposed language has both declarative and operational semantics which are based on fix-point theory, as in the typical logic programming [31]. Moreover, the issue of relaxing queries by extending the proposed query language will be investigated. Finally, the chapter presents a special case study for the proposed framework (based on XML data structure). 1.1 Research Review Semistructured and XML data modeling and retrieval earlier research efforts have focused on query languages development, constraint-based approaches, query relaxation and XML specific query languages : Semistructured data is usually modeled as a labeled graph, in which nodes correspond to

the data objects and the edges to their attributes. Most query languages proposed for semistructured data navigate the data using Regular Path Expressions, thus traversing arbitrary long paths in the graph. In [2] a query language, called Lorel, for semistructured data is defined by extending [16] with powerful and flexible path expressions, which allow querying with no precise structure knowledge. Path expressions are built from labels and wild-card (place-holders) using regular expressions, allowing the user to specify rich patterns that are matched to actual paths in the database graph. One of the limits of this language is that it does not allow to express recursive queries over the database graphs. In the language presented in [10], UnQL, is loosely related to Lorel, allowing to query data organized as a root, edge-labeled graph. A primary feature of UnQL is a powerful construct called traverse that allows restructuring of trees to arbitrary depth. The language of terms uses variables ranging over trees or over edge labels. A tree is seen as a set of edge/sub-tree pairs. In [17], a more rigidly typed approach is followed in a language called OQL-doc, with the model having strong similarity to OEM (due to the introduction of heterogeneous collections). Optimizing the evaluation of generalized path expressions is considered in [18] where the optimization is based on two object

Page 2: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

algebra operators, (one dealing with paths at the schema level and the other with paths at the data level).

In [25] conjunctive queries that allow for incomplete answers in the framework of semistructured data are studied. The proposed model of query evaluation consists of a search phrase (involving search constraints), where a query graph containing variables is used to match a maximal portion of the database graph, and a filter phrase (involving filter constraints) where the maximal matching resulting from the search phrase are subjected to constraints. The authors deliberately limited their investigation to queries that do not allow regular path expressions. Also related to our work are several queries for the World Wide Web (WWW) that have recently emerged. e.e W3QL, which focuses on extensibility, and WebLog which is based on Datalog-like syntax. Additional relevant work includes query languages for hypertext structures, and work on integrating SGML. Documents with relational databases, since SGML documents can be viewed as semistructured data. In relation to path constraints Abiteboul and Vianu [3] investigate the evaluation and optimization of path expression queries involving path constraints. Path constraints are local; they may capture the structural information about a web site (or a collection of sites). A path constraint is an expression of the form p q or p = q where p and q are regular expressions. A path constraint p q holds at a given site if the answer to query p applied to that site is included in the answer to q applied to the same site (and similarly to p = q). The constraint p = q is also allowed to our constraint language.

Constraints such as p q and p Џ q cannot be expressed as a word constraint of [3]. A class of path constraints useful for both structure and semistructured data for specifying natural integrity constraints is proposed by Buneman et al. in [13]. A path constraint is an expression of either the forward form (forward form) or the backward form (backward

form) where p, q, γ are paths. The constraint language cannot express queries like α β and α Џ β.When a database does not contain the necessary information to provide the answer for an imposed query. The way the database management system deals with the issue of incomplete information provided by either the database or the query itself varies. In [15] different types of constraints are considered and discussed in relation to how the expressive power of the constraint language may influence the complexity of checking subsumption between schemas. A new semistructured data model that extends the Βasic Data model For Semistructured data (BDFS) ([12]) with both constraints and incomplete theories is proposed. The problem of queries with incomplete answers over semistructured data has also been studied in [25]. The main goal of this study is to investigate the principles of queries that allow for incomplete answers. Generally, semistructured data models are different from the classical "structured" data models by two characteristics: they do not assume that the data has homogeneous data structure, and they do not assume that data are incomplete. This study mainly focuses on the second characteristic, and according to it a congenial query language for incomplete data must allow incomplete answers as query results. Some theoretical principles of such languages are also presented.

XML [9] is a new standard adopted by the World Wide Web Consortium (W3C) to complement HTML for data exchange on the Web. It is a data format for Web applications. XML documents do not have to be created and used with respect to a fixed, existing schema. This is particularly useful in Web applications, for simplifying exchange of documents and for dealing with semistructured data. Its emergence as a standard for data representation on the Web is expected greatly to facilitate the publication of electronic data by providing a simple syntax for data that is both human and machine-readable. Regarding querying over XML documents, in [29] a notion of subsumption to capture the relationship between XML types is proposed. Intuitively, subsumption captures not just the fact that one type is contained into another, but also captures some of the structural relationships between the two schemas. In this work is shown that subsumption can be used to facilitate commonly used typed-related operations on XML

Page 3: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied in [5]. In this study a very simple model for XML and their DTDs is considered and a representation system for incomplete information is developed. The incomplete information about an XML document is continuously enriched by successive queries to the document. The proposed representation system can represent partial information about the source document acquired by successive queries, and that it can be used to intelligently answer new queries. In [42] the problem of answering queries imposed over XML document collections is reduced to the well-known unordered tree-inclusion problem, which is additionally extended to an optimization problem by applying a cost model to the embeddings. Finally a framework for querying XML databases by specifying ordering constraints over the documents is proposed in [23].

1.2 Contribution Overall the framework presented in this chapter integrates formalisms of Feature Logics, Constraint Logic Programming and Databases and its main contributions are summarized next :

(1) Development of a simple and flexible structure for representing semistructured data. The structure is called role trees and is inspired by Feature Constraint Systems. Trees are generally useful for structuring data in modern applications, and this empowers our data structure with even more interesting potential.

(2) Definition of two types of constraints imposed over semistructured data. The ordering constraints allow a declarative relationships specification between trees (representing semistructured data) while path constraints allow to constrain the navigation of the trees. In our case the constraints are of different expressiveness.

(3) Proposal of a declarative, rule-based, constraint query language that will be used to infer relationships between semistructured data. We view our query language as consisting of two constraint languages on top of which relations can be defined by definite clauses. The language has declarative and operational semantics.

(4) Introduction of a way for approximating answers, a task that requires a refinement of the query evaluation model. In fact, rules allowing constraint relaxation will be proposed and their functionality will be proven.

(5) Application of the proposed framework on XML data (i.e. a special format of semistructured data) under assumptions and specializations in both the generic data model and query language.

Chapter Outline: Section 1 has presented an overview of research efforts in the area of semistructured and XML data modeling as well as in the area of query formulation and processing in semistructured databases. The basic notions of constraints are given in Section 2. Section 3 defines the features of semistructured data, the structure for representing them as well as the constraints that allow for reasoning over these structures. In Section 4 we propose a query language, provide its syntax and semantics, whereas Section 5 refers to ordering and path constraint extension as well as in query relaxation. Section 6 presents a case study of the proposed framework under the XML data format and conclusions are given in Section 7. 2 Constraints : Preliminaries A constraint is some piece of syntax constraining the values of the variables occurring in it, and therefore, is a restriction in the space of possibilities. Mathematical constraints are precisely specifiable relations among several unknowns (or variables), each taking a value in a given domain. Constraints restrict the possible values that variables can take, representing some (partial) information about variables of interest. Some of the interesting attributes of constraints are the following:

Page 4: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

Constraints may specify partial information - a constraint may not uniquely specify the value of its variables

The constraints are additive: given a constraint c1 say X + Y ≥ Z, another constraint can be added, say, X + Y ≤ Z. The order of imposition of constraints does not matter; all that matters at the end is that the conjunctions of the constraints is in effect.

Constraints are rarely independent; for instance, once c1 and c2 are imposed it is the case that the constraint X + Y = Z is entailed.

Constraints are nondirectional: typically a constraint on (say) three variables X, Y, Z can be used to infer a constraint on X given constraints on Y and Z or a constraint on Y given constraints on X and Z and so on.

The constraints are declarative: typically they specify what relationship must hold without specifying computational procedure to enforce that relationship.

Any computational system dealing with constraints should fundamentally take these properties into consideration. A constraint language is built on top of a constraint system and is defined by a tuple (VAR, CON, V, INT) such that: 1) VAR is an infinite set whose elements are called variables 2) CON is a set whose elements are called constraints 3) V is a computable function that assigns to every constraint a finite set V of variables,

called the variables constraint by 4) INT is a nonempty set of interpretations, where every interpretation I INT consists

of a nonempty set DI, called the domain of I, and a solution mapping [.] such that: (a) an I-assignment is a mapping VAR → DI, and ASSI is the set of all I-assignments. (b) [.]I is a function mapping every constraint to a set []I of I-assignments, where the I-assignments in []I are called solutions of in I. (c) a constraint contains only the variables in V, that is, if α []I and β is an I-assignment that agrees with α on V, then β []I .

Predicate logic is a prominent example of constraint language. The well-formed formulas are the constraints, V are the free variables of a formula , and for every Tarski interpretation I the solutions []I are the I-assignments satisfying . Viewing predicate logic as a constraint language abstracts away from the synthetic details of formulas. A constraint is satisfiable if there exists at least one interpretation I in which has a solution. A constraint is valid in an interpretation I if []I = ASSI , that is, every I-assignment is a solution of √ in I. An interpretation is a model of a set Φ of constraints if it satisfies every constraint in Φ. A renaming is a bijection VAR → VAR that is the identity except for finitely many exceptions. If ρ is a renaming, we cal ΄ a constant a ρ-variant of a constraint if V΄ = ρ(V) and []I = [΄]I ρ: = αρ|α [΄]I for every interpretation I. A constraint ΄ is called a variant of a constraint if there exists a renaming ρ such that ΄ is a ρ-variant of . Proposition 1 A constraint is satisfiable if and only if each of its variables is satisfiable. Furthermore, a constraint is valid in an interpretation I if and only if each of its variants is valid in I. A constraint language is closed under renaming if every constraint has a ρ-variant for every renaming. A constraint language is closed under intersection if for every two

Page 5: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

constraints and ΄ there exists a constraint ψ such that []I ∩ [΄]I for every interpretation I. A constraint language is decidable if the satisfiability of its constraints is decidable. Let Φ be a set of constraints and I be an interpretation. The solutions of Φ in I are defined as: [Φ]I : = []

I

where [Φ]I : = if Φ is empty. Now, we introduce a general schema for handling clauses whose variables are constrained with an underlying constraint theory. In general, constraints can be seen as quantifier restrictions as they filter out the values that can be assigned to the variables of the clause (or an arbitrary formulae with restricted universal or existential quantifier) in any of the models of the constraint theory. We give a resolution principle for clauses with constraints, where unification is replaced by testing constraints for satisfiability over the constraint theory. The Resolution Principle is based over the following inference rule: From (A B) and (A C) infer (B C)

The rule is obviously correct: If (A B) and ( A C) are true, then the resolvent (B C) is also true. Since either A is false, then B must be true, or A is true, then C must be true and hence in any case (B C) must be true. For predicate logic the two complementary A's are atoms starting with the same predicate symbol, but with potentially different argument terms, say s

i and ti (i [1,n]). Then one has to unify corresponding arguments of the literals A and A -that is to find substitutions for the variables that make the corresponding argument

terms identical- before the conclusion can be drawn, and the resolvent has to be instantiated with the unifying substitution:

C

CttPBandss nn

),...,(),...,( 11 if σ si = σ ti ni 1

Robinson [40] showed that this rule (combined with factorization) provides a complete calculus. An analysis of the soundness and the completeness of the proof, however, shows that it is not necessary to unify the terms. It would be enough to test whether they are unifiable, provided we add a constraint Γ consisting of term equations si = ti (saying "if the terms can be made equal") to the inferred resolvent. Whenever such constrained resolvents are now involved in a further resolution step the new resolvent inherits their constraints together with the new argument term equations. Now, these collected constraints have to be tested for unifiability. Thus a resolution step in this modification takes two clauses with such equational constraints and produces a resolvent with a new unifiable equational constraint consisting of the constraints inherited from its parents together with the argument equations involved complementary literals.

ii

nn

tsCBCttPandBssP

),...,(),...,( 11 is unifiable ii ts

A more general view is that every clause might have some arbitrary, not necessarily equational constraint and a resolvent of two clauses gets a new constraint that is unsatisfiable whenever one of the constraints of the parents (or the equational constraint of the arguments of the complementary literals) is unifiable.

Page 6: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

ii

nn

tsSRCBSCttPRandBssP

),...,(),...,( 11 if is satisfiable ii tsSR

The constraint resolution principle is again sound, but as in classical case completeness of a constraint resolution calculus is not straightforward. 3 Semistructured Data : Data Model and Queries In traditional databases (e.g. the relational model [19]) there is a clear separation between the schema and the data itself. Recently it has been recognized that there are applications where the data is self-describing in the sense that it does not come with a separate schema, and the structure of the data (when it exists) has to be inferred from the data itself. Such data is called semistructured. A frequent scenario for semistructured data is when data is integrated (in a simple fashion) from several heterogeneous sources and there are discrepancies among the various data representations. For example, some information may be missing in some sources, an attribute may be single-valued in one source and multi-valued in another, or the same entity may be represented by different types in different sources. Overall, the semistructured data have the following properties:

The Structure is irregular: In many applications, the large collections that are maintained often consist of heterogeneous elements. Some of these elements may be incomplete or others may record additional information (e.g. annotations). Moreover, different types may be used for the same kind of information. For example, the price of some items may be expressed in dollars in a portion of the database and in francs in another portion. Modeling and querying such irregular structures is rather a complicated issue.

The structure is implicit: In many applications, although there is a precise underlying structure, this structure is given implicitly. For instance, electronic documents consist of a text and a grammar (i.e a DTD in SGML). The parsing of the document then allows one to isolate pieces of information and detect relationships between them. However, the interpretation of these relationships may be beyond the capabilities of standard database models and are left to the particular applications and specific tools.

Management of semistructured data requires typical database features such as languages for forming adhoc queries and updates, concurrency control, secondary storage management, etc. However, because semistructured data cannot conform to a standard database framework, trying to use conventional DBMS to manage semistructured data becomes difficult or impossible task. When querying semistructured data one cannot expect the user to be fully aware of the complete structure especially if the structure evolves dynamically. Thus, it is important not to require full knowledge of the structure to express meaningful queries. At the same time we do not want to be able to exploit regular structure during query processing when it happens to exist and the user happens to know it. An example of a complete database management system for semistructured data is LORE [32], a repository for OEM [46] data featuring the LOREL [2] query language. Another system devoted to semistructured data is the Strudel Web site management system, which features the StruQL query language [21] and a data model similar to OEM. UnQL [10] is a query language that allows queries on both the content and the structure of a semistructured database and also uses a data model similar to Strudel and OEM. eXtensible Markup Language (XML) is an emerging standard for web data, and bears a close correspondence to semistructured data models introduced in many research efforts. Semistructured data is naturally modeled in terms of graphs with labels that provide the semantics to the underlying structure [1, 10, 11]. Informally, the vertices in such graphs represent objects and the labels on the edges convey semantic information about the

Page 7: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

relationship between the objects. The vertices without outgoing edges (sink nodes) in the graph represent complex objects. Path expressions can describe the path along the graph, and can be viewed as compositions of labels. Here, we assume that the usual types (like String, Integer, Real, etc.) are available. Additionally, we add a new type called Feature for labels that would correspond to an attribute (i.e. label) names. We write numbers and labels literally, and use quotation marks for strings, e.g., car. In what follows we make the simplifying assumption that labels can be symbols, strings, integers, etc; in fact the type of the labels is just the discriminated union of these base types. The graph models that are used to represent semistructured data can formally be defined as follows: The semistructured Graph-Model represents semistructured data by a directed labeled graph G = (N, E) where N is a finite set of labeled nodes and E is a finite set of labeled edges. An edge e is written as (n1, α, n2) where n1 and n2 are members of N and α is the label of the edge. A specific class of directed graphs -the trees- will be used as individual units of our constraint system. There are two types of trees that can be used to represent semistructured data, based on the special attributes they have. These trees are the feature trees and the role trees. A feature tree is a tree with unordered, labeled edges and labeled nodes. The edge labels are called features. The features are functional in that two features labeling edges departing from the same node are distinct. This means that the feature trees are deterministic structures. In terms of programming, features correspond to record field selectors and node labels to record field contents.

In theothdolabsemfinsemdatheof A rol

Figure 1 Example of semistructured data represented as a many levels role tree

our framework we extend the notion of feature trees to that of the role trees. In the case of role trees two features labeling edges departing from the same node may not be distinct, in er words role trees are non-deterministic data structures. A role tree is defined by its tree

main and its labeling function. The domain of a role tree τ is the multi-set of all words eling a branch from the root of τ to a node of τ. In our case, we will use for the istructured data representation. It is obvious that a role tree is finite if its tree domain is

ite and in general the domain of a role tree can also be infinite in order to model istructured data with cyclic dependancies. A graphical example of how semistructured

ta can be represented by a role tree is given in Figure 1 (the non-deterministic features of role tree representing a literature books' library). The labels that correspond to the features the semistructured data are put on the edges of the tree. role tree can be viewed as an information carrier and this rise an ordering relation over e trees which will be called information ordering.

Page 8: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

Intuitively, a role tree τ1 More precisely, this meandomain of τ2 and that theτ2. In this case we write the right one since it coFrench language. Query Examples: The fvariables (i.e., variablesranging composition of rthe database. The formal aannss This query returns the seα) from the root (of eachthis query (expressed aJJoohhnn((αα,, xx)) is called thein the tree xx leading from aannsswweerr((xxThis query returns a set (valuation of a) in xx andJJoohhnn)).. This query returns pairs Two trees xx and yy are subsumed by zz. In other w 4 Constraint Lang Our query language invdefined by definite clause 4.1 Data Model and CTo give a rigorous formwhose symbols are callesets of sorts, and the lett

Figure 2 Ordering Relation over Role trees

is smaller than a role tree τ2 if τ1 has fewer edges and labels than τ2. s that every word of roles in the tree domain of τ1 belongs to the tree

partial labeling function of τ1 is contained in the labeling function of τ1 τ2. For example, the left tree shown in Figure 2 is smaller than nveys no information that the specific entry is a book written in

ollowing are examples of queries. In these queries, x and y are tree ranging over trees), and α, β, are path variables (i.e., variables oles). We use the predicate symbol ttrreeee to denote the set of trees in semantics of the constructs used in constraints will be given later.

wweerr((xx)) ttrreeee((xx)) JJoohhnn((αα,, xx))

t of trees such that there is a path (here the valuation of the variable tree answer to the query), leading to the sort (i.e., value) JJoohhnn.. In s a rule), aannsswweerr((xx)) is called the head of the query and ttrreeee((xx)),, body of the query. The notation S((αα,, xx)) means that there is a path α the root to the set of sorts S. ,, yy)) ttrreeee((xx)),, ttrreeee((yy)) yx ,, JJoohhnn((αα,, xx)),,JJoohhnn((ββ,, yy)) of pairs ((xx,, yy)) of trees such that xx subsumes yy and there is a path a path (valuation of β) in yy leading to the same set of sorts (here

aannsswweerr((xx,, yy)) ttrreeee((xx)),, ttrreeee((yy)) x ~~ yy of trees that are compatible. The symbol ~~ stands for compatibility. compatible if there is another three zz such that xx and yy are both ords, xx ~~ yy )( zyzxz ..

uages for Semistructured Data

olves two constraint languages on top of which relations can be s.

onstraints alization of role trees, we first fix two disjoint alphabets S and F, d sorts and roles, respectively. The letters S, S΄ will always denote ers f, g, h will always denote roles. Words over F are called paths.

Page 9: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

The concatenation of two paths υ and w results in the path υw. The symbol ε denotes the empty path, υε = ευ = υ, and F* denotes the set of all paths. A tree domain is a nonempty set D F* that is prefix-closed; that is, if υwD, then υ D. Thus, it always contains the empty path.

A role tree is a mapping t: D P(S) from a tree domain D into the powerset P(S) of S. The paths in the domain of a role tree represent the nodes of the tree; the empty path represents its root. The letters s and t are used to denote role trees.

When convenient, we may consider a role tree t as a relation, i.e., t F* P(S), and write (w, S) t instead of t(w) = S. (Clearly, a relation t F* P(S), is a role tree if and only if D = w |S : (w, S) t is a tree domain and t is relational). As relations, i.e., as subsets of F* P(S), role trees are partially ordered by set inclusion. We say that s is smaller than (or, is a prefix-subtree of; or, subsumes; or, approximates) t if s t.

The subtree wt of a role tree t at one of its nodes w is the role tree defined by (as a relation) wt : = (v, S) | (wv, S) t If D is the domain of t, then the domain of wt is the set w-1 D = v | wv D. Thus, wt is given as the mapping wt : w-1 P(S) defined on its domain by wv(t) = t(wv). A role tree s is called a subtree of a role tree t if it is a subtree s = wt at one of its nodes w, and a direct subtree if w

F.

A role tree t with domain D is called rational if (1) t has only finitely many subtrees and (2) t is finitely branching; that is: for every w D, wF D = wf D | f F is finite. Assuming (1), the condition (2) is equivalent to saying that there exist finitely many roles f

1,,fn such that D f 1,,fn *. Constraints over role trees will be defined as first-order formulae. 4.2 Ordering Constraints In the following, we introduce the syntax and semantics of ordering constraints over role trees. We assume an infinite set (which we denote by V) of tree variables ranged over by x, y,

an infinite set (which we denote by V ) of path variables ranged over by α, β, an infinite set F of roles

~

1 ranged over by f, g, and an arbitrary set S of sets of sorts denoted by S, T containing at least two distinct elements. 4.2.1 Syntax and Semantics Syntax. An ordering constraint φ is defined by the following abstract syntax. φ : : = x y | S(α, x) | x[f]y | x ~ y | φ 1 φ 2 An ordering constraint is a conjunction of atomic constraints which are either atomic ordering constraints x y, generalized labeling constraints S(α, x), selection constraints x[f]y, or compatibility constraints x ~ y.

Semantics. The signature of the structure contains the binary relation symbols ≤, ~, and S(,) (for every set of labels S), and for every role f a binary relation symbol [f]. The domain of the structure R is the set of possibly infinite role trees. The relation symbols are interpreted as follows: τ1 τ2 iff D

D

and L

L

1 2 1

2

1 In our case, a role is either subelement or value.

Page 10: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

τ1 [f] τ2 iff D = p | f p D

and L

= (p, S) | (f p, S ) L

2 1 2 1

S (α, τ) iff µ(α)Dτ and (µ(α), S) Lτ τ1 ~ τ2 iff L

L

is a partial function (on D

D

) 1 2 1

2

where µ is a valuation from V to the set of elements F*. ~

4.2.2 Satisfiability Test A set of axioms (valid for our constraint system) and their interpretation as an algorithm that solves the satisfiability problem of our constraint system is presented next : Table 1 has the axioms schemes F1 - F6 whereas the union of these sets of axioms is denoted by F. For instance, an axiom scheme x x represents the infinite set of axioms obtained by instantiation of the meta variable x. An axiom is either a constraint φ, an implication between constraints φ φ΄ or an implication φ false.

F1.1 x x F1.2 x y y z x z F2 x[f]x' x y y[f]y' x' y' F3.1 x x F3.2 x y y z x z F3.3 x y y x F4 x[f]x' x y y[f]y' x' y' F5 S (, x) S '(, x) false for S S ' F6 S (, x) S '(, y) x y false for S S

Table 1: Axioms of Satisfiability: F1-F6

The role tree structure T is defined as follows:

The domain of T is the set of all role trees. tAT if and only if t(ε) = A (t's root is labeled with the sort A). (s, t) f T if and only if f Ds and t = f s (t is the subtree of s at f).

Proposition 2 The structure T is a model of the axioms in F. Proof We prove the statement for the rules in F5 and F6.

F5) S(α, x) S ΄(α, x) iff p D x and (p, S) L x and (p, S΄) L x. Or p is a path (a finite sequence of labels). This means that we cannot have simultaneously (p, S) Lx and (p, S ΄) Lx. F6) S(α, x) S ΄(α, y) x~y S(α, x) S΄(α, y) z(x z y z)

S(α, x) S(α, z) S ΄(α, y) S ΄(α, z) false (according to F5, i.e.,S(α, x) S ΄(α, x) false)

4.3 Path Constraints-Syntax and Semantics Some additional definitions will enlight the syntax and semantics of the proposed query language : Definition 1 (Path) A path is a finite string of labels. We identify a label (i.e., a role name in our case) f with the string (f) consisting of a single label. We say that a path u is a prefix of a path v (written u<v) if there is a non-empty path w such that v=u. Note that < is neither symmetric nor reflexive. We say that two paths u, v diverge (written u Џ v) if there are labels

Page 11: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

f, g with f g, and possible empty paths w, w 1, w2, such that u = w.f.w1 v = w.g.w 2. It is clear that Џ is a symmetric relation. Proposition 3 Given two paths u and v, then exactly one of the relations u = v, u < v, u > v or u Џ v holds. Definition 2 (Path Term) A path term, denoted p, q,..., is either a path variable α or a

concatenation of path variables α .

Syntax and Semantics The language of terms uses a countable set V of variables called path variables, and denoted a,β... Definition 3 (Path Constraint) The set of atomic constraints is given by:

c prefix

p L path restriction

path equality α Џ β divergence

L is a regular expression denoting a regular language L (L) F+, where F is a set of roles (or labels). An interpretation I is a standard first order structure, where every label f F is interpreted

as a binary relation fI. A valuation νV is a function νV : V F~ ~~

+. We define νV (a·β) to be νV (α) νV (β).

~

~ ~

The validity of an atomic constraint is an interpretation I under a valuation (νV , νV ) is defined as follows:

~

νV |= I prefix ~

νV |= I p L path restriction ~

νV |= I path equality ~

νV |= I α Џ β divergence ~

A constraint is satisfiable if there exists at least one interpretation in which has a solution. Satisfiability of conjunctions of our atomic constraints is decidable [6]. 5 Ordering and Path Constraint Extension Query Relaxation Given the constraint languages C (for constraints over role trees) and C' (for path constraints), a set of relation symbols, extends C and C' to a constraint language R(C, C'). We define the predicate symbol tree to denote semistructured data represented as trees. We reason about semistructured data by a program P which contains a set of rules defining ordinary predicates. The rules are of the form: H ( ) L1 ( 1 ),, Ln ( n )||c1,, cm

for some n 0 and m 0, where , 1 ,, n are tuples of tree variables or path

variables. We require that the rules are safe, i.e., a variable that appears in must also appear in 1 ... n path variables and tree variables occurring in c1,, cm. The predicates L1,, Ln may be either ttrreeee or ordinary predicates. c1,, cm are ordering

Page 12: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

constraints (C-ccoonnssttrraaiinnttss) or path constraints (C'-ccoonnssttrraaiinnttss). In the following, we use the term (positive) atom to make reference to predicates L1,, Ln. 5.1 Semantics Our language has a declarative model-theoretic and fixpoint semantics. The language of terms uses three countable, pair-wise disjoint sets: 1 A set D that is the union of two pair-wise disjoint sets:

D1: a set of tree domains D2: a set of roles

2 A set V of variables called tree variables, and denoted x, y,

3 A set V of variables called path variables, and denoted α, β, ~

Model-theoretic semantics

V denotes a set of variables called tree variables, and V denotes a set of variables called path

variables. Let V = V . Let var

~

^ ~V 1 be a countable function that assigns to each syntactical

expression a subset of V corresponding to the set of tree variables occurring in the expression,

and var2 be a countable function that assigns to each expression a subset of V corresponding to the set of path variables occurring in the expression.

~

Let var = var1 var 2. If E1,, En are syntactic expressions, then var(E1,, En) is an abbreviation for var(E1) var(Ε... n). A ground atom A is an atom for which var(A) = ø. A ground rule is a rule r for which var(r) = ø. Definition 4 (Extension) Given a set D2 of labels, the extension of D2, written D , is the set of path expressions containing the following elements:

ext2

each element in D2 for each ordered pair $p_1, p_2$ of elements of D ,the element pext

2 1 · p2 Definition 5 (Extended Active Domain) The active domain of an interpretation I, noted DI

is the set of elements appearing in I, that is, a subset of D1 D2. The extended active domain of I, denoted D , is the extension of Dext

I I, that is, a subset of D1 D . ext2

Definition 6 (Interpretation) Given a program P, an interpretation I of P$ consists of:

A domain D A mapping from each constant symbol in P to an element of domain D A mapping from each n-ary predicate symbol in P to a relation in (Dext)n

Definition 7 (Valuation) A valuation ν1 is a total function from V to the set of elements D1. A

valuation ν2 is a total function from V to the set of elements D . Let ν = ν~

ext2 1 ν 2. ν is

extended to be identity on D and then extended to map free tuples to tuples in a natural fashion.

Page 13: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

Definition 8 (Atom Satisfaction) Let I be an interpretation. A ground atom L is satisfiable in I if L is present in I. Definition 9 (Rule Satisfaction) Let r be a rule of the form: r : A L1,, Ln || c1,, cm where L1,, Ln are (positive) atoms, and c1,, cm are constraints. Let I be an interpretation, and ν be a valuation that maps all variables of r to elements of D ext

I . The rule r is said to be true (or satisfied) in interpretation I for valuation ν if ν [A] is present in I whenever each ν [Li], i [1, n] is satisfiable in I, and each ν [cj], j [1, m] is satisfiable. 5.1.1 Fixpoint Semantics The fixpoint semantics is defined in terms of an immediate consequence operator, TP, that maps interpretations to interpretations. An interpretation of a program is any subset of all ground atomic formulas built from predicate symbols in the language and elements in Dext. Each application of the operator TP may create new atoms. We show below that TP is monotonic and continuous. Hence, it has a least fixpoint that can be computed in a bottom-up terative fashion. Recall that the language of terms has two countable disjoint sets: a set of tree domains (D1), and a set of roles (D2). A path expression is an element of D . We define Dext

2ext = D1 D . ext

2

At the iteration 1, only paths of length2 1 (i.e., simple roles, elements of D2) are considered during rules triggering. At iteration k, we consider paths of length less or equal to k. It is clear that the paths occurring in the final result are of length less or equal to the longest path that can be found in all the tree domains. Lemma 1 If I1 and I2 are two interpretations such that I1 I2, then . ext

IextI DD

21

Definition 10 (Immediate Consequence Operator) Let P be a program and I an interpretation. A ground atom A is an immediate consequence for I and P if either A I, or there exists a rule r : H L 1,, Ln || c1,, cm in P, and there exists a valuation ν, based on D , such that: ext

I

A = ν (H), and is satisfiable, and )(],,1[ iLni

ν (c1,, cm) satisfiable.

Definition 11 (T-Operator) The operator TP associated with program P maps interpretations to interpretations. If I is an interpretation, then TP(I) is the following interpretation: TP(I) = I A | A is an immediate consequence for I and P

2 The length of a path is the number of roles composing the path.

Page 14: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

Lemma 2 (Monotonicity) The operator TP is monotonic; i.e., if I1 and I2 are two interpretations such that I1 I 2, then TP(I1) TP(I2) Proof Let I1 and I2 be two interpretations such that I1 I 2. We must show that if an atom A is an immediate consequence for I1 and P, then A TP(I2). Since A is an immediate consequence for I1 and P, at least one of the following cases applies:

A I1. Then A I2, and thus A TP(I2); there exists a rule r : H L 1,, Ln || c1,, cm in P and a valuation ν, based

on D , such that A = ν (H), [1, n] ν(LextI1

i

i

i) is satisfiable, and ν (c1,, cm)

satisfiable. Following the Lemma 1, ν is also a valuation based on D . Since IextI2 1

I2, we have ν(Li) satisfiable [1, n], and ν(c 1,, cm) satisfiable. Hence A T P(I2).

Theorem 1 (Continuity) The operator TP is continuous, that is, if I1, I2, I3, are interpretations such that I1 I 2 I 3 (possibly infinite sequence), then TP )()( iPiii ITI . Proof Let I = and let A be an atom in Tii I P(I). We must show that A is also in

. )( iPi IT

At least one of the following two cases applies: A I , i.e., A . Then, there exists some j such that A Iii I

i T

j. Thus, ATP(Ij) and consequently A . )( iP I

There exists a rule r: H L1,, Ln|| c1,, cm in P and a valuation ν based on D

such that [1, n] ν(L

extI

i

)( iI

i) is satisfiable and ν(c1,, cm) satisfiable. Since ν(Li) satisfiable, there exists some ji such that ν(Li) satisfiable in I . In addition, since the

Iij

k are increasing, there exists some l, such that I Iij

extI

l for all ji. Hence, ν(Li) satisfiable in Il i [1, n] and ν(c1,, cm) satisfiable. Let V = var(L1,, Ln) be the set of variables in the rule r, and let ν(V) be the result of applying ν to each variable in V. ν(V) is a finite subset of D since ν is based on D . We have ν(Lext

I i) satisfiable

[1, n] and ν(ci

1,, cm) satisfiable. Thus, ν(var(Li)) satisfiable in D [1, n] and ν(c

extI1

i

1,, cm) satisfiable. Then ATP(Il) (A = ν(H)). Consequently A . Pi T

Page 15: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

Lemma I is a model of P iff TP(I) I.

Proof " " If I is an interpretation and P a program, then let cons(P, I) denote the set of all ground facts which are immediate consequences for I and P. TP(I) = I A | A is an immediate consequence for I and P

For any element A in cons(P, I), at least one of the following cases holds:

A I. By definition of immediate consequence; there exists a rule r : H L 1,, Ln||c1,, cm in P, and a valuation ν such that

[1, n] ν(Li i) satisfiable in I, ν(c1,, cm) satisfiable, and A = ν(H). Since I is a model of P, I satisfies r (I |= r), and then A I. Thus, TP(I) I.

" " Let I be an interpretation and P be a program.

Let r: H L 1,, Ln||c1,, cm be any rule in P and ν any valuation. If [1, n] ν(Li i) satisfiable in I and ν(c1,, cm) satisfiable in I, then ν(H) TP(I). Because TP(I) I, we have ν(H) I, and then I satisfies r (I |= r). Hence I |= P.

Lemma 4 Each fixpoint of TP is a model for P. Proof Follows immediately from Lemma 3. Theorem Let P be a program and I an input such that the minimal model for P exists, then the minimal model and the least fixpoint coincide. Proof Let P be a program and I an interpretation. Let us denote by P(I) the minimal model of P containing I. According to lemma 3, TP(P(I)) P(I). TP is monotonic, so TP(TP(P(I))) TP(P(I)), and then TP(P(I)) is a model of P containing I. As P(I) is the minimal model containing I, we have P(I) T P(P(I)). As P(I) is a fixpoint of P and also a minimal model of P, each fixpoint of TP containing I is a model of P containing P(I). Thus P(I) is the minimal model of P containing I. 5.2 Query Relaxation When optimal solutions (i.e., those data trees matching the query tree) cannot be obtained, one may be interested in finding sub-optimal solutions by relaxing some constraints. This section shows how this can be achieved in the context of our query language. More specifically, according to the semantics of our query language, the satisfiability test for the constraint part of queries assumes ordered inclusion of trees, leading to exact matching of paths and sorts. For the need of approximate query answers, we have to deal with unordered inclusion of trees, and set constraints. The tree, which can be a query tree, on the left of Figure 3 is not perfectly embedded in the data tree on the right. The member node in the data tree is skipped. Definition 12 (Embedding) A function f from the set Q of query nodes into the set D of data nodes is called an embedding if for all qi, qj Q: 1. f(qi) = f(qj) q i = qj

label(qi) = label(f(qi))

Page 16: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

2. qi is the parent of qj with e the label of the edge from qi to qj f(q i) is an ancestor of f(qj) with e the label of an edge from f(qi) to f(qj)

Let p = α1 αn be a path. We write αi p to denote the fact that αi is a component of p. We write αi <p aj when i < j. That is, αi is an ancestor of αj in p. Let p = α1 αn and q = b1 bm be two paths. We say that p is embedded in q, and we write p q iff

1. n m 2. α i p, ai q (i [1, n])

3. a i, aj p, ai <p aj αi <q aj

Let Q be a query of tconstraint of the formexact answer, then reX is a new set variabl X S (also called squery, under the sem

To summarize, let L

the constrainby X(, x), X

the constrainnew path var

Figure 3: Query Relaxation for Semistructured Data

he form L C where C is the constraint part. Assume that C contains a S(p, x), where S is a set and p is a path. In the case Q doesn't return an

write the constraint C by replacing S(p, x) by X(α, x), X S, α p, where e, and α is a new path variable.

et constraint) and α p are weak constraints. The evaluation of the new antics of and will return a set of approximate answers.

C be a query, where C is its constraint part. Then, If C contains:

t S(, x), where stands for a path or a path variable, then replace S(, x) S, with X a new set variable.

t (p, x), where p is a path, then replace (p, x) by (α, x), α p, with α a iable.

Page 17: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

6 Case Study: The XML Data Format XML has a widespread acceptance in practical applications of databases. It is acknowledged for its simplicity and its expressibility. As such, the field of semistructured databases has proven to be an important research platform and in many ways setting the standard for future database technology. In order to meet the demands for more expressive and flexible ways to interact with XML data, it is important to go beyond what can be formalized by traditional tools. 6.1 XML - Peliminaries XML3 is a textual representation of data. The basic component in XML is the element, that is, a piece of text bounded by matching tags such as < book > and < /book >. Inside an element we may have "raw" text, other elements, or a mixture of the two. Consider the following XML example: << bbooookk >>

<< ttiittllee >> TThhee PPhhaannttoomm ooff tthhee OOppeerraa << //ttiittllee >> << aauutthhoorr >> GGaassttoonn LLeerroouukk << //aauutthhoorr >> << llaanngguuaaggee >> FFrreenncchh << //llaanngguuaaggee >>

<< //bbooookk >> An expression such as < book > is called a start-tag and << //bbooookk >> an end-tag. Start- and end-tags are also called markups. Such tags must be balanced; that is, they should be closed in inverse order to that in which they are opened, like parentheses. Tags in XML are defined by users; there are no predefined tags, as in HTML. The text between a start-tag and the corresponding end-tag, including the embedded tags, is called an element, and the structures between the tags are referred to as the content. The term subelement is also used to describe the relation between an element and its component elements. Thus << ttiittllee >> << //ttiittllee >> is a subelement of << bbooookk >> << //bbooookk >> in the example above. As with semistructured data, we may use repeated elements with the same tag to represent collections. The following is an example in which several << bbooookk >> tags occur next to each other. << lliitteerraattuurree >>

<< bbooookk >> << ttiittllee >> TThhee PPhhaannttoomm ooff tthhee OOppeerraa << //ttiittllee >> << aauutthhoorr >> GGaassttoonn LLeerroouukk << //aauutthhoorr >> << llaanngguuaaggee >> FFrreenncchh << //llaanngguuaaggee >>

<< //bbooookk >> << bbooookk >>

<< ttiittllee >> DDaavviidd CCooppppeerrffiieelldd << //ttiittllee >> << aauutthhoorr >> CChhaarrlleess DDiicckkeennss << //aauutthhoorr >> << llaanngguuaaggee >> EEnngglliisshh << //llaanngguuaaggee >>

<< //bbooookk >> << //lliitteerraattuurree >> The basic XML syntax is perfectly suited for describing semistructured data. Recall the syntax for semistructured data expressions. The simple XML document

Page 18: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

<< bbooookk >>

<< ttiittllee >> TThhee PPhhaannttoomm ooff tthhee OOppeerraa << //ttiittllee >> << aauutthhoorr >> GGaassttoonn LLeerroouukk << //aauutthhoorr >> << llaanngguuaaggee >> FFrreenncchh << //llaanngguuaaggee >>

<< //bbooookk >> has the following representation as a semistructured data expression: bbooookk:: ttiittllee:: ""TThhee PPhhaannttoomm ooff tthhee OOppeerraa"",, aauutthhoorr:: GGaassttoonn LLeerroouukk,, llaanngguuaaggee:: FFrreenncchh There is a subtle distinction between an XML element and a semistructured data expression. A semistructured data expression is a set of label/sub-tree pairs, while an element has just one top-level label. XML denotes graphs with labels on nodes4. 6.2 XML data representation and Query Relaxation In this chapter, we consider that an XML document is represented by a role tree of a specific form. The only two edge labels are ssuubbeelleemmeenntt and vvaalluuee. Figure 4 illustrates our representation for the XML data above. Path expressions describe paths along the graph, and can be viewed as compositions of labels. For example, the expression ssuubbeelleemmeenntt..vvaalluuee continues to the subelement of that object, and ends in the value of that subelement. The representation of XML data using this type of role trees is shown in Figure 4.

Figure 4: XML Data Representation The query language built for semistructured data and the way relaxation was dealt within the context of the proposed query language are fully adaptable for XML documents represented by the special type of role trees described above. An example of XML role tree embedding is shown in Figure 5.

Page 19: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

Figure 5: Query Relaxation for XML Data 7 Conclusions Semistructured databases are an area of increasing research value and interest. As the amount of data available on the Web increases rapidly, frameworks and tools that facilitate browsing and filtering become increasingly important for interacting with such exponentially growing information resources and for dealing with access problems. This chapter has presented a representation model for semistructured data (called role trees) that captures the special features of the data. Moreover, two types of constraints have been presented, namely ordering and path constraints. On top of these two constraint systems a declarative query language has been developed. Issues related to query relaxation, and answers provided by the database system when an accurate answer to the query imposed cannot be provided have also been investigated. Finally, the proposed framework has been applied on a special test case of semistructured data as formed by XML documents. Acknowledgements The authors thank Dimitra Terzi for her contribution in formatting the paper and her suggestions on the papers layout, which have improved the papers presentation and readability. References

[1] S. Abiteboul: Querying Semi-Structured Data. In Rroceedings of the International Conference of Database Theory (ICDT '97), Delphi Greece, pages 1-18, Janvier 1997.

[2] S. Abiteboul, D. Quass, J. McHugh, J. Widom and J. L. Wiener: The Lorel Query Language for Semistructured Data, International Journal on Digital Libraries, Vol(1), No.1 pp. 66-88, 1997.

Page 20: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

[3] S. Abiteboul and V. Vianu: Regular Path Queries with Constraints, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Databases (PODS'97), ACM Press, pp.122-133 Tucson, Arizona, May 1997.

[4] S. Abiteboul and V. Vianu: Queries and Computation on the Web, Proceedings of the 6th International Conference on Database Theory (ICDT97), pp. 662-675, Delphi, Greece, 1997.

[5] S. Abiteboul, V. Vianu and L. Segoufin: Representing and Querying XML with Incomplete Information PODS 2001, California USA, May 2001.

[6] R. Backofen: Regular Path Expressions in Feature Logic, Journal of Symbolic Computation, 17:421-455, 1994.

[7] C. Beeri and Y. Kornatski: A Logic Query Language for Hypermedia Systems, Information Systems Journal, Vol(77),1994.

[8] G. Blake, M. Consens, P. Kilpeläinen, P. Larson, T. Snider and F. Tompa: Text/relational Database Management Systems: Harmonizing SQL and SGML, Proceedings of the First International Conference on Applications of Databases, pp. 267-280, Vadstena, Sweden, 1994.

[9] T. Bray, J. Paoli, C.M. Sperberg-McQueen and E. Maler: Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation. Technical Report, http://www.w3.org/TR/REC-xml, October 6 2000.

[10] Peter Buneman and Susan Davidson and Gerd Hillebrand and Dan Suciu: A Query Language and Optimization Techniques for Unstructured Data, Proceedings of the ACM SIGMOD International Conference (SIGMOD'96), pp. 505-516, Montreal, Canada, June 1996.

[11] P. Buneman: Semistructured Data, Proceedings of the ACM Symposium on Principles of Database Systems, Tuscon, AZ, USA, 1997.

[12] P. Buneman, S. Davidson, M. Fernandez and D. Suciu: Adding structure to unstructured data, Proceedings of 6th International Conference on Database Theory (ICDT '97), pp. 336-350, 1997.

[13] P. Buneman, W. Fan and S. Weinstein: Path Constraints on Semistructured and Structured Data, Proceedings of the seventeenth ACM-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'98), ACM Press, pp. 129-138, Seattle, Washington, 1998.

[14] D. Calvanese, G. De Giacomo and M. Lenzerini: Representing and Reasoning on XML Documents: A Description Logic Approach, Journal of Logic and Computation, 9(3), 295-318, 1999.

[15] D. Calvanese, G. De Giacomo and M. Lenzerini: Semi-structured Data with Constraints and Incomplete Information, In Proc. of the 1998 Description Logic Workshop (DL'98), pp 11-20. CEUR Electronic Workshop Proceedings.

[16] R. Cattell: The Object Database Standard: ODMG-93, Morgan Kaufmann San Mateo, CA, 1994.

[17] V. Christophides, S. Abiteboul, S. Cluet and M. Scholl: From Structured Documents to Novel Query Facilities, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'94), pp. 313-324, May 1994.

[18] V. Christophides, S. Cluet and G. Moerkotte: Evaluating Queries with Generalized Path Expressions, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'96), pp. 313-322, June 1996.

[19] E.F. Codd: Extending the Database Relational Model to Capture more Meaning, ACM Transactions on Database Systems, 4:397-434, 1979.

[20] M. P. Consens and A. O. Mendelzon: Expressing Structural Hypertext Queries in Graphlog, Proceedings of the Second ACM Conference on Hypertext, Pittsburgh, Pennsylvania, November 1999.

[21] M. Fernandez, D. Florescu, A. Levy and D. Suciu: A Query Language for a Web Site Management System, SIGMOD Record, 26(3):4-11, 1997.

[22] C. F. Goldfarb and Y. Rubinski: The SGML Handbook, Clarendon Press, Oxford, UK, 1990.

[23] M-S. Hacid, E. Terzi and A. Vakali: Querying XML with Constraints, Proc. of the Second International Conference on Internet Computing (IC'01), Las Vegas, USA, June 2001.

[24] H. Aït-Kaci, A. Podelski and G. Smolka: A Featured-Based Constraint System for Logic Programming with Entailment. Theoretical Computer Science, 122:263-283, 1994.

[25] Y. Kanza, W. Nutt and Y. Sagiv: Queries with Incomplete Answers over Semistructured Data Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Databasests (PODS'99), ACM Press, pp. 227-236, Philadelphia, Pennsylvania, May 1999.

Page 21: Modeling and Querying Web Datakolari1/Mining/papers/vakali...data, such as type assignment or query processing. Representation and querying of XML with incomplete information is studied

[26] R. M. Kaplan and J. Bresnan: Lexical-Functional Grammar: A Formal System for Grammatical Representation, in J. Bresnan editor, The mental Representation of Grammatical Relations, pp. 173-381, MIT Press, Cambridge (MA), 1982.

[27] W. Kim: On Object Oriented Database Technology. UniSQL Product Literature, 1994. [28] D. Konopnicki and O. Shmueli: W3QS: A Query System for the World-Wide Web,

Proceedings of the 21th International Conference on Very Large Databases (VLDB'95), pp. 54-65, Morgan Kaufman, Zurich, Switzerland, September 1995.

[29] G. M. Kuper, J. Sim : Subsumption for XML types, ICDT 2001 pp.331-345. [30] L. V. S. Lakshmanan, F. Sadri and I. N. Subramanian: A Declarative Language for Querying

and Restructuring the Web, Proceedings of the Sixth International Workshop on Research Issues in Data Engineering (RIDE'96), February 1996.

[31] J. W. Lioyd: Foundations of Logic Programming, Springer Verlang, Second Edition, 1987. [32] J. McHugh, S. Abiteboul, R. Goldman, D. Quass and J. Widom: LORE: A Database

Management System for Semistructured Data, SIGMOD Record, 26(3):54-66, 1997. [33] A. O. Mendelzon, G. A. Mihaila and T. Milo: Querying the World Wide Web, International

Journal on Digital Libraries, No. 1, Vol. (1), pp.54-67, 1996. [34] A. O. Mendelzon and T. Milo: Formal Models of Web Queries, In Proceedings of the 6th

ACM-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97), pp. 134-143, Tuscon, Arizona, ACM Press 1997.

[35] T. Minohara and R. Watanabe: Queries on Structure in Hypertext, Foundations of Data Organization and Algorithms (FODO'93), pp. 394-411,1993.

[36] A. Motro: Extending the Relational Database Model to Support Goal Queris, Expert Database Systems, Larry Kerschberg Editor, Copyright 1987, by The Benjamin/Cummings Publishing Company, Inc.

[37] M. Müller, J. Niehren and A. Podelski: Ordering Constraints over Feature Trees, In Gert Smolka, editor, Principles and Practice of Constraint Programming-CP97, Third International Conference (CP'97), Linz, Austria, LNCS 1330, pp. 297-311. Springer Verlag, 1997.

[38] A. Rafii and R. Ahmed and M. Ketabchi and P. DeSmedt and W. Du: Integrating Strategies in the Pegasus Object Oriented Multidatabase System, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences, Vol (II), pp. 323-334, 1992.

[39] R. Rao and B. Janssen and A. Rajaraman: GAIA Technical Overview, Technical Report, Xerox Palo Alto Research Center, 1994.

[40] J.A. Robinson: A Machine Oriented Logic Based on the Resolution Principle, Journal of the ACM, 12(1), 1965.

[41] International Organization for Standarization. ISO-8879: Information Processing - Text and office systems - Standard Generalized Markup Language (SGML), October 1986.

[42] T. Schlieder and F. Naumann: Approximate tree embedding for querying XML data. In ACM SIGIR Workshop On XML and Information Retrieval, Athens, Greece, July 2000.

[43] R. L. Schwartz: A Nutshell Handbook: Learning Perl, O'Reilly \& Associates, Inc., 1993. [44] S.M. Shieber: An Introduction to Unification-Based Approaches to Grammar. Vol. (4) of

CSLI Lecture Notes, Center for the Study of Language and Information, Stanford University, 1986.

[45] G. Smolka and R. Treinen: Records for Logic Programming, Journal of Logic Programming, 18(3): 229-258, April 1994. J. Widom, Y. Papakonstantinou and H. Garcia Molina: Object Exchange Across Heterogeneous Information Sources, Proceedings of the 11th International Conference on Data Engineering (ICDE'95), pp. 251-260, Taipei, Taiwan, 1995.

[46] J. Widom, Y. Papakonstantinou and H. Garcia Molina: Object Exchange Across Heterogeneous Information Sources, Proceedings of the 11th International Conference on Data Engineering (ICDE'95), pp. 251-260, Taipei, Taiwan, 1995.