integrity constraint integration in heterogeneous databases: an enhanced methodology for schema...

24
Informarion S~~swm.~ Vol. 22, No. 8. pp. 423-146. 1997 (1 1997 Elsevier Science Ltd. All rtghts reserved Pergamon PII: SO306-4379(97)00027-6 Printed in Great Britam 0306-4379197 $17.00 + 0.00 INTEGRITY CONSTRAINT INTEGRATION IN HETEROGENEOUS DATABASES: AN ENHANCED METHODOLOGY FOR SCHEMA INTEGRATION+ VENKATAR RAMESH’ and SUDHARAM’ ‘Department of Accounting and Information Systems, Indiana University, Bloomington, IN 47405 ‘Department of Management Information Systems, University of Arizona, Tucson, AZ 85721 (Received 18 July 1995; in,final revisedfonn 25 September 1997) Abstract - In today’s technologically diverse corporate environment, it is common to find several different databases being used to accomplish the organization’s operational data management functions. Providing interoperability among these databases is important to the successful operation of the organization. One approach to providing interoperability among heterogeneous database systems, is to define one or more schemas which represent a coherent view of the underlying databases. In the past, most approaches have used schematic knowledge about the underlying databases to generate integrated representations of the databases. In this paper we present a seven step methodology for utilizing integrity constraint knowledge from heterogeneous databases. Specifically, we describe how we can generate a set of integrity constraints applicable at the integrated level from constraints specified on local databases. We introduce the concept of constraint-based relationships between objects in heterogeneous databases and describe the role that these relationships play in integrity constraint integration. Finally, we describe how the integrated set of constraints generated using our methodology can be used to facilitate semantic query processing in a heterogeneous database environment c 1997 Elsevier Science Ltd. All rights reserved Key words: Integrity Constraints, Heterogeneous Database Integration, Semantic Query Processing, Schema Integration 1. INTRODUCTION Databases have been an integral part of most organizations’ computing infrastructure for the past two decades. It is likely that these databases have been developed over a period of time, each database being developed to meet the existing needs of one or more units within the organization. Thus, these databases are likely to be heterogeneous in that they may have been implemented using different data models, database technology as well as hardware platforms. It is also common to find that many of these databases contain overlapping data. It is becoming increasing important to develop mechanisms that will allow interoperability among these databases to support the day-to-day data management functions within modern organizations. One approach to providing interoperability among heterogeneous databases, is to define one or more schemas that represent a coherent view of the underlying databases. The process of generating these schemas is known as schema integration. Researchers have been studying issues in schema integration for over a decade. Batini et al. [2] summarized the characteristics of early schema integration methodologies and noted that the methodologies a) did not provide algorithmic specifications of the steps in integration and b) placed little emphasis on automation of the various steps in their methodologies. Since then, several researchers have reported on methodologies for schema integration that use a richer data model and provide algorithmic specifications for some portions of the schema integration process [7, 23, 12, 22, 9, 151. All of these approaches to schema integration use the semantics ascertained from schematic knowledge about the underlying databases to facilitate the schema integration process. Sheth and Gala [21] note that integrated representations generated using schematic knowledge alone may not reflect the real world state of the ’ Recommended by Mattbias Jarke 423

Upload: venkatar-ramesh

Post on 03-Jul-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Informarion S~~swm.~ Vol. 22, No. 8. pp. 423-146. 1997 (1 1997 Elsevier Science Ltd. All rtghts reserved

Pergamon

PII: SO306-4379(97)00027-6 Printed in Great Britam

0306-4379197 $17.00 + 0.00

INTEGRITY CONSTRAINT INTEGRATION IN HETEROGENEOUS DATABASES:

AN ENHANCED METHODOLOGY FOR SCHEMA INTEGRATION+

VENKATAR RAMESH’ and SUDHA RAM’

‘Department of Accounting and Information Systems, Indiana University, Bloomington, IN 47405

‘Department of Management Information Systems, University of Arizona, Tucson, AZ 85721

(Received 18 July 1995; in,final revisedfonn 25 September 1997)

Abstract - In today’s technologically diverse corporate environment, it is common to find several different databases being used to accomplish the organization’s operational data management functions. Providing interoperability among these databases is important to the successful operation of the organization. One approach to providing interoperability among heterogeneous database systems, is to define one or more schemas which represent a coherent view of the underlying databases. In the past, most approaches have used schematic knowledge about the underlying databases to generate integrated representations of the databases. In this paper we present a seven step methodology for utilizing integrity constraint knowledge from heterogeneous databases. Specifically, we describe how we can generate a set of integrity constraints applicable at the integrated level from constraints specified on local databases. We introduce the concept of constraint-based relationships between objects in heterogeneous databases and describe the role that these relationships play in integrity constraint integration. Finally, we describe how the integrated set of constraints generated using our methodology can be used to facilitate semantic query processing in a heterogeneous database environment c 1997 Elsevier Science Ltd. All rights reserved

Key words: Integrity Constraints, Heterogeneous Database Integration, Semantic Query Processing, Schema Integration

1. INTRODUCTION

Databases have been an integral part of most organizations’ computing infrastructure for the past two decades. It is likely that these databases have been developed over a period of time, each database being developed to meet the existing needs of one or more units within the organization. Thus, these databases are likely to be heterogeneous in that they may have been implemented using different data models, database technology as well as hardware platforms. It is also common to find that many of these databases contain overlapping data. It is becoming increasing important to develop mechanisms that will allow interoperability among these databases to support the day-to-day data management functions within modern organizations. One approach to providing interoperability among heterogeneous databases, is to define one or more schemas that represent a coherent view of the underlying databases. The process of generating these schemas is known as schema integration.

Researchers have been studying issues in schema integration for over a decade. Batini et al. [2] summarized the characteristics of early schema integration methodologies and noted that the methodologies a) did not provide algorithmic specifications of the steps in integration and b) placed little emphasis on automation of the various steps in their methodologies. Since then, several researchers have reported on methodologies for schema integration that use a richer data model and provide algorithmic specifications for some portions of the schema integration process [7, 23, 12, 22, 9, 151. All of these approaches to schema integration use the semantics ascertained from schematic knowledge about the underlying databases to facilitate the schema integration process. Sheth and Gala [21] note that integrated representations generated using schematic knowledge alone may not reflect the real world state of the

’ Recommended by Mattbias Jarke

423

Page 2: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

424 VENKATARRAMESHANDSUDHARAM

underlying database objects (in this paper we use objects to refer to entities and attributes) accurately. Schematic information, however, represents only one of the many forms of knowledge available about the semantics of a database. In particular, integrity constraints defined on a database are an important source of knowledge describing the semantics of the underlying databases [8]. Since the objective of schema integration is to develop an integrated representation that accurately reflects the semantics of the databases being integrated, we contend that a schematic representation augmented with a set of integrity constraints would reflect the real world state of the underlying databases more accurately. Hence, it is necessary to not only integrate the underlying schemas but also the integrity constraints specified against the databases that are modeled by the schemas. In this paper, we describe how the schema integration process in a heterogeneous database environment can be augmented to generate an integrated set of integrity constraints (from constraints specified on the local databases) applicable to each integrated schema.

Another important issue in heterogeneous database interoperability is the development of efficient mechanisms for accessing data from the underlying databases. In a heterogeneous database environment it is important to use/develop query processing mechanisms that do not rely on the syntactic structure of the underlying databases to improve query efficiency since such information may not always be available. Semantic Query Optimization (SQO) techniques utilize knowledge about the semantics of the data to transform user queries into one or more equivalent database queries that are more efficient than the original user query. Such techniques have been shown to make data access more efficient in relational and deductive databases [lo, 201. Integrity constraints are the primary source of semantic knowledge utilized by SQO mechanisms. Several authors have described approaches to SQO. King [lo], Shenoy and Ozsoyoglu [20], Bertino and Musto 131 and Seigel et al. [18] describe approaches to SQO that dynamically transform each query issued to a database against the constraints specified on the databases. Chakravarthy et al. [4] and An and Henschen [l], on the other hand, describe approaches to SQO that are based on precompilation of constraints. These approaches reduce the constraints specified on the databases into a form that can readily be applied to an incoming query. All these approaches, however, deal with SQO in the context of a single database system. van Kuijk [24] and Pan et al. [14] appear to be the only discussions of semantic query processing (SQP)’ in a multiple database context. [24] presents SQP in a distributed database context while Pan et al. [14] present their discussion in a multi-database context. However, the discussion presented in each of these papers is rudimentary at best and does not address any new issues that may arise from using semantic query processing techniques in a multiple database context. Reddy et al. [ 161 refer to the possibility of performing semantic query processing using an integrated set of integrity constraints. However, they do not describe any mechanisms for deriving the integrated set of integrity constraints. In this paper, we describe how the generation of an integrated set of integrity constraints enables Semantic Query Processing (SQP) in a heterogeneous database environment.

The rest of this paper is organized as follows. The next section provides a framework for the use of integrity constraints in a heterogeneous environment. Section 3 presents a description of the steps that facilitate integrity constraint integration. Section 4 describes how these integrity constraints can be used to facilitate semantic query processing in a heterogeneous environment. A summary of results from a simulation study are also presented in this section. Finally, we present some conclusions and directions for future research.

2. FRAMEWORK FOR HETEROGENEOUS DATABASE INTEGRATION

Figure 1 presents an enhanced framework for schema integration that utilizes integrity constraints in a heterogeneous database environment. We present a detailed description of the enhanced methodology which consists of seven steps below. The steps common to both methodologies are identified as such in the description.

+ Semantic Query Processing techniques attempt to transform the queries issued on a database using the constraints specified on it. This may result in multiple such queries being generated. Semantic Query Optimization techniques, additionally, attempt to generate a rank-ordering of these alternative queries using appropriate heuristics.

Page 3: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constmint Incegntion in Hrteropncous Dumbases 425

Schwnas to be Schematic

integrated __111) Interschema

Relationship

Generator

Schematic

Intsrscttefna

Relation*ip8

“Real Wortd”

Jnterschema

Relationship 4 Generator

‘Real World *

Interschema

Relationships

Canatraint-baaed

Interschema

Rslation8hipa

t

Constraint

Evaluator 4

Constraint-based

Interschema

Relationships

I

f

Integrated

Schema

Generator \ I

Inteqratad

Schema

I Integrator I

Integrated Integrity Constraints &

Integrated Schema

Sernanticalh/ Equivalent Queries

1

_ Integrity

Constrairlta

Query

Transformer

Fig. 1: Enhced Methodology for Schema Integntion

Page 4: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

426 VENKATARRAMESHANDSUDHARAM

1) Schema Translation: This phase is common to both methods and involves translating the local schemas into schemas using a semantic model. 2) Schematic Interschema Relationshiu Generation: In this phase, which is also common to both methodologies, the schematic properties of the schemas being integrated are analyzed to generate schematic interschema relationships among objects. However, unlike the traditional methodology, user confirmation of interschema relationships is not sought after this phase. Instead, confirmation of these relationships is sought by analyzing the databases using another source of knowledge, namely integrity constraints. 3) Constraint-Based Interschema Relationship Generation: This phase is unique to the enhanced methodology. Integrity constraints specified on the underlying databases are analyzed to generate interschema relationships among objects. These relationships are generated based on the degree to which integrity constraints involving an object a in database Di are valid (after appropriate transformation) in database Dz and constraints involving an object b in database Dz are valid in database Di, where a and b are schematically related objects. 4) “Real World” Interschema Relationship Generation: This phase is unique to the enhanced methodology. In this phase, the schematic and constraint-based relationships generated in steps 2) and 3) are utilized to arrive at “real world’ relationships among objects. These are interschema relationships among objects that are more representative of the actual relationships (based on their real-world states) among the database objects. The use of schematic and constraint-based relationships should enable us to arrive at a better set of interschema relationships among database objects compared to using schematic knowledge alone, since we are using two sources of knowledge to arrive at these relationships. This should also enable us to reduce the burden of interschema relationship identification that is put on the user by most schema integration methodologies. 5) Integrated Schema Generation: This phase is common to the traditional and the enhanced methodology. The primary difference is in the inputs to the integrated schema generation process. In the enhanced methodology, the “real world” relationships generated in the previous step, rather than schematic relationships, are used as inputs to the schema integration algorithms. 6) Integrity Constraint Intezration: The next two phases of the methodology are unique to the enhanced methodology. This phase utilizes the constraint-based and “real world” relationships generated in steps 3 and 4, the integrated schema generated in step 5 as well as the knowledge of the integration strategies used in step 5, to generate a set of integrated constraints applicable to the integrated schema. Performing this step enables us to generate a more comprehensive representation of the underlying databases. 7) Semantic Ouerv Processing: In addition to generating a more comprehensive representation, the presence of integrated integrity constraints provides us with an opportunity to use the semantic query processing (SQP) techniques elaborated on in King (1981) and Chakravarthy et al. (1990) in a heterogeneous database context. The use of these techniques would allow us to transform queries formulated against the integrated schema into equivalent more efficient local database queries.

In the following section, we elaborate on the details of the constraint-based relationship generation and integrity constraint integration steps of the methodology detailed above. We illustrate the use of integrated integrity constraints in semantic query processing in Section 4.

3. A METHODOLOGY FOR INTEGRATING INTEGRITY CONSTRAINTS

3. I Assumptions

Generating an integrated set of integrity constraints is a two-phased process: constraint-based relationship generation and integrity constraint integration. For clarity, we will present our discussion in the context of integrating two underlying databases.

It is assumed that the schema translation step generates schemas represented using a semantic model such as the Entity-Relationship Model [5]. Integrity constraints are represented using a first-order logic representation. It should also be noted that most of the examples used in the next few sections use a certain type of integrity constraint (implication constraints) to illustrate concepts. This is because such constraints are most useful in semantic query processing. However, it should be noted that our methodology encompasses any constraint that can be represented as a Horn clause.

Page 5: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constraint Integration in Heterogeneous Databases 421

3.2 Constraint-Based Relationship Generation

Generating the unified set of constraints is not a trivial process. One cannot simply transform the integrity constraints (involving a particular predicate) into its equivalent form in the integrated schema. This is because such a transformation implies that the new constraint applies to both the underlying databases, which may not be the case. Careless transformation can result in an incorrect set of constraints being generated.

The problem with the above approach is that we did not determine the applicability of constraints specified in one database to the other and vice versa. To solve this problem we introduce the concept of constraint-based relationships among objects in heterogeneous schemas. Constraint-based relationships represent interschema relationships among database objects generated by analyzing the characteristics of constraints specified on the databases.

Consider an integrity constraint of the form:

Sl cR1

where, S 1 utilizes an attribute x1 and Rl utilizes an attribute yI. Such a constraint describes restrictions placed by certain values of y1 on the values that xl takes. Hence, the integrity constraint can be thought of as describing a property of the attribute yl as well as describing characteristics of the entity class E to which yl belongs. If the same constraint is valid in another database for an attribute Y’~ belonging to an entity class E’, then it is likely that y1 and Y’~ have some relationship as do E and E’. This is the premise underlying constraint-based relationships among objects (entities and attributes) in heterogeneous databases.

Notations

a) In the rest of this paper, A and B are objects (entities or attributes) in the databases D, and Dz, respectively, between which we are trying determine constraint-based relationships. The sets K-A and IC- B represent the set of integrity constraints involving the object, i.e., constraints involving the predicate. IC’-A, IC”-A, IC’B and 1C”B are non-overlapping, non-empty subsets of IC-A and IC-B, i.e.,

i) IC-A = ICY-A u ICY-A, IC-B = IC’-B u IC”-B,

ii) IC’-A n IV-A = $, IC’-B n IC”-B = I$

iii) IC’-A # $ , IC”-A # 9, IC’-B # $I , IV-B f Cp

b) Every constraint specified against a database is considered to involve one or more database objects (entities or attributes). Accordingly, we associate a constraint with the object it involves. We define the word involving an object as follows. An integrity constraint of the form:

S c Rl, R2, R3, . . . . . . Rn

where S defines a restriction on an attribute in Rii,i,, and Rii=i,, are relations. For each Ri, there may be one or more attributes Ai with value restrictions. Every relation Ri with such an attribute Ai is said to be involved in the constraint. If the attributes Ai involved in the constraint belong to different relations, then the constraint becomes a part of two sets, one belonging to each object Ri.

c) We define the operator I- (adapted from [ 171) as the operator that checks the validity of a constraint on

a database. If x E IC-A, x ]- D2 is true if the results of executing the query corresponding to the constraint

x on Dz does not violate the constraint.

Definition 1 A CBequiv B if the constraints involving the objects can be placed into two sets IC-A and IC-B, such that

V x, x E IC-A, x I- D2

V y, y E K-B, Y I- DI

The definition states that A CBequiv B, if every constraint involving object A in database Di is valid in database Dz and every constraint involving object B in database D2 is valid in database Dr.

Page 6: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

428 VENKATARRAMESHANDSUDHARAM

Definition 2 A CBsubsumes B if the constraints involving the objects can be placed into three sets, IC-A, IC’-B and IC”-B such that

V x, x E IC-A, x I- Dz V y, y E IC’-B, y I-D, V z, z E BY-B, +z I-D,)

In other words, every constraint involving object A in database Dt is valid in database Dz but there is at least one constraint involving object B in database Dz that is not valid in database Di. Hence, the set. of constraints involving object B can be divided into two non-empty subsets, IC’-B and BY-B, where ICY-B contains the constraints that are valid in database Dt and KY-B contains the constraints that are not valid in database D 1.

Definition 3 A CBoverlap B if the constraints involving the objects can be placed into four sets, IC’-A, IC”-A, IC’-B and IC”-B, such that

V w, w E IC’-A, w I- Dz V x, x E IC”-A, 1(x I- D2) V y, y E IC’-B, y I- D, V z, z E ICY-B, -,(z I- D,)

In other words, there are some constraints involving object A in database Dt that are not valid in database Dz and there are some constraints involving object B in database Dz that are not valid in database Dt. Thus, we can divide the set of constraints involving object A in database Di and object B in database D2 into four non-overlapping, non-empty sets, IC’-A, IC’-B, ICY-A and IC”-B. IC’-A and IC’-B are the sets that consist of constraints involving an object that are valid in another database, and ICY-A and BY-B are the sets that consist of constraints involving an object in a database that are not valid in the other database.

Definition 4 A CBdisjoint B if the constraints involving the objects can be placed into two sets, IC-A and IC-B such that

V x, x E IC-A, 1(x I- Dz)

V y, y E IC-B, 4y I- W

In other words, no constraint involving object A in database Di is valid in database D2 and no constraint involving object B in database D2 is valid in database Dt.

The definitions presented above are mutually exclusive, i.e., only one type of relationship can exist between any pair of objects. The above relationships are also exhaustive, i.e., no other type of constraint- based relationship can exist between two objects.

3.2.1 Evaluation of Constraint-Based Relationships

Generating constraint-based relationships means that we need to implement the operator I- that evaluates the validity of a constraint on a database. Evaluating the validity of a constraint specified on a database Di in a database Dz requires 1) the transformation of the constraint (on Di) into a valid query on database Ds and 2) evaluating the results to test the validity of the constraint.

Performing the first step in the context of a heterogeneous database presents a problem because the predicate and attribute names (in D2) that correspond to the ones specified in the integrity constraint.(in Di) are not known and vice-versa. In addition, trying to determine the validity of every single integrity constraint will take a significant amount of time. Hence, certain heuristics are needed to limit the object pairs that will be subject to the constraint-based relationship generation process.

Fortunately, both these problems can be solved by using knowledge about schematic relationships (prefixed by SCH) generated using the heuristics described in [15]. We use this schematic knowledge as the starting point for constraint-based relationship (prefixed by CB) generation. Objects that are

Page 7: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Con&mint Integration in Heterogeneous Databases 429

schematically disjoint are not evaluated for constraint-based relationships. For each pair of objects that is schematically related, the integrity constraints involving object A in database Di are transformed into a legal query on database D2 and vice versa. The results of these queries are then evaluated and CB relationships between the objects are generated based on the definitions presented above.

Example 1 Integrity Constraint Evaluation

Schema A: Ships(x1 ,OWNER,SHIPTYPE,ORIGIN,x5,x6,x7)

IC-A: OWNER=‘Onassis’7 Ships(xl,OWNER,supertanker,ORIGIN,xS,x6,x7) ---(a) (if Shiptype = ‘Supertanker’ then Owner = ‘Onassis’)

Schema B: Boats(xl,COUNTRY,OWNER,TYPE)

IC-B: -, Boats(x 1, ICELAND,OWNER,TYPE) --- (b) (there are no Boats with Country = ‘ICELAND’)

Predicates Ships and Boats are SCHequivalent ShipsOWNER and Boats.OWNER are SCHequivalent Ships.ORIGIN and Boats.COUNTRY are SCHequivalent

Ships.SHIPTYPE and Boats.TYPE are SCHequivalent q

Consider the schemas shown in Example 1. To determine the constraint-based relationships between the predicates Ships and Boats we would need to evaluate the validity of each of the constraints against the other database. Using the schematic interschema relationships available to us we would transform the integrity constraint in (a) into the query:

retrieve * from Boats where SHIPTYPE = ‘supertanker’

We could then say that (a) is valid in the Boats database if the tuples resulting from this query have Owner = ‘Onassis’. Similarly, we would transform the constraint (b) into the query

retrieve Ships from Ships where ORIGIN = ‘ICELAND’

and say that (b) is valid in the Ships database if no tuples are returned as a result of this query. If both constraints are valid then from Definition 1 we would able to assert that Ships CBequiv Boats. It should be noted that, we can also assert that Ships.SHIPTYPE CBequiv Boats.TYPE and Boats.COUNTRY CBequiv Ships.ORIGIN.

Once the constraint-based relationships have been generated, they can be used in conjunction with schematic interschema relationships to arrive at a set of interschema relationships among database objects that are closer to reality than either individual set of relationships taken alone [15]. We refer to these relationships as “real world” relationships (RWR). Table 1 summarizes the different “real-world” relationships that can be generated from constraint-based and schematic relationships among entity classes. For example, the table shows that, if it is true for two objects A and B that, A SCHequiv B and A CBequiv B then we would generate A RWequiv B. Table 1 has been generated with the goal of ensuring that the integrated schema generated can represent a complete set of integrity constraints, without violating the semantics of the underlying database. We make extensive use of these rules during the integrity constraint integration process (described below).

Page 8: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

430 VENKATAR RAMESH AND SUDHA RAM

Constraint-Based A CBequiv B A CBsubsume B a CBoverlap B Schematic Relationships Relationships A SCHequiv B A RWequiv B A RWsubsume B A RWoverlap B A SCHsubsume B A RWsubsume B A RWsubsume B A RWoverlap B A SCHoverlap B A RWoverlap B A RWoverlap B A RWoverlap B

Table 1: Rules for Generating “Real World” Interschema Relationships from Schematic and Constraint-Based Relationships

3.3 Integrity Constraint Integration

The objective of this phase is to generate a set of integrity constraints at the integrated schema level, However, generating the unified set of integrity constraints is not a trivial process. In particular, one cannot simply transform the integrity constraints (involving a particular predicate in a local schema) into its equivalent form in the integrated schema. This is because an integrity constraint that involves an entity at the integrated level implies that the constraint is valid in all databases which have entity classes that can be mapped to the integrated entity. Hence, such a transformation may result in erroneous information being supplied to the user.

Since every integrity constraint applicable to the integrated schemas will be associated with entity classes at the global level, generation of an integrated set of constraints is dependent on the schema integration strategy used to generate entities at the global schema level. We follow the guidelines presented in Larson et al. [12] for schema integration. For each integrity constraint integration rule, we show that the constraints generated by the application of the rules will not violate the semantics of the underlying databases.

Definition 5 We define the operator 8, the transformation operator, as the operator which transforms a constraint on a local database to a constraint at the integrated level. Formally, if A is an entity in a local schema and A’ is an entity in the integrated schema and x E IC-A, then 0~(x) generates an equivalent constraint x’i IC-A’. The 0 operator ensures that the new constraint utilizes the appropriate predicate and argument names.

Case 1: A RWequiv B

Table 1 suggests that a RWequiv relationship exists among objects that are schematically equivalent. Such a relationship is represented, schematically, by a single entity class AB’ in the integrated schema, which represents the semantics of both the underlying entity classes. Table 1 also indicates that the only type of constraint-based relationship that can result in the generation of the above real world relationship is a CBequiv relationship.

Rule 1 If there is a CBequiv relationship, then the individual integrity constraints can be transformed as follows:

v X, X E IC-A, eAa’(x)

v y, y E K-B, eAB'(y)

where the 8 operator transforms an integrity constraint specified against a local schema into an equivalent constraint expressed in terms of the integrated schema’s predicate AB’.

Proo$ Since AB’ represents an integrated representation of the local entity classes A and B it needs to be shown that

V q, q E IC-AB’, q I- Di and q I- Ds i) We know that all constraints q, q E IC-AB’ were generated by QAa(x) or eAa(y).

Page 9: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constraint Integration in Heterogeneous Databases 431

a) x E IC-A -_;, x I- D, (by definition) is true. Hence, V q, q E IC-AB’ generated by O&x), q I- D, is

true. Also, A CBequiv B + x I- Dz, V x, x E IC-A. Hence, V q, q E IC-AB’ generated by O&x), q l-

Dz is also true

b) y E IC-B + y I- D2 (by definition) is true. Hence, V q, q E IC-AB’ generated by OAa3(y), q I- Dz is

true. Also, A CBequiv B -_) y I- D1, V y, y E IC-B. Hence, , V q, q E IC-AB’ generated by B&y), q

I- Dt is also true cl

Consider the databases in Example 1. Given these conditions schema integration is likely to result in a predicate Ship_Boats(xl, 0 WNER,SHIPTYPE, ORIGIN,X~,X~,X~,X~) where x8 is an attribute corresponding to xl in the predicate Boats. Based on Rule 1, we would generate the following integrity constraints at the integrated level

OWNER=‘Onassis’+ Ships_Boats(xl,OWNER,supertauker,ORIGIN,x8) c Ship_Boats(xl,OWNER,SHIPTYPE,ICELAND,x5,x6,x7,xS)

Note that without determining the existence of the CBequiv relationship we would not be able to generate the transformed integrity constraints since such an integrity constraint assumes that the constraints apply to both the underlying databases which may or may not be the case.

Case 2: A RWsubsumes B

Table 1 indicates that such a “real-world” relationship can exist among objects that are schematically equivalent or subsumed. Table 1 also shows that a RWsubsume relationship can exist when the underlying objects have a CBequiv or CBsubsume relationship. Hence, we present rules for integrity constraint integration for these two types of CB relationships.

A CBequiv B

A RWsubsumes relationship among two objects that have a subsumption relationship is represented in the integrated schema by two entity classes A’ and B’, such that B’ is a subclass of A’. Integrity constraints belonging to the original entity classes A and B are integrated as follows:

Rule 2: If A CBequiv B, then the individual integrity constraints involving A and B are transformed as follows:

i) V w, w E IC-A, 8 A(~),

ii) V x, x E IC-B, 0,.(x),

iii) V y, y E IC-B, 0 a”(y),

iv) V z, z E IC-A, 0,(z)

ProoJ: Since A’ and B’ represent the integrated schema representations of the underlying entity classes A

and B it needs to be shown that i) V r, r E IC-A’, r I- Dt and ii)V s, s E IC-B’, s I- D2

i) We know that all constraints r, r E IC-A’ were generated by e,(w) or e,(x).

a) w E IC-A -+ w I- D, (by definition) is true. Hence, V r, r E IC-A’ generated by 0,(w), r I- Dt is

true.

b) A CBequiv B + x I- D1 is ture V x, x E IC-B. Hence, V r, r E IC-A’ generated by 0,(x), r I- D1 is

also true.

ii) We know that all constraints s, s E IC-B’ were generated by t&(y) or t&(z).

a) y E IC-B -+ y I- D2 (by definition) is true. Hence, V s, s E IC-B’ generated by &(y), s I- D2 is true.

b) A CBequiv B + z I- D2 is true V z, z E IC-B. Hence, V s, s E IC-B’ generated by &Y(Z), s I- Dz is

also true. 0

Page 10: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

432 VENKATARRAMESHANDSUDHARAM

Informally, we transform all integrity constraints involving the predicates A and B into equivalent integrity constraints using the predicates A’ and B’ respectively. In addition, B’ inherits all of A’ constraints.

A CBsubsume B

From Table 1, we can see that when A CBsubsumes B, a A RWsubsumes B relationship can be a result of a A SCHequiv B or A SCHsubsume B relationship.

A RWsubsumes relationship among two objects that are related using schematic equivalence or subsumption is represented by two entity classes A’ and B’, where A’ is a superclass of B’. All the attributes belonging to A and B are transformed into attributes in A’ and are subsequently inherited by B’. Integrity constraint integration is performed according to the following rule:

Rule 3: If A CBsubsume B, then the individual integrity constraints are transformed as follows:

i) VW, WE IC-A, e,(w),

ii) v x, x E IC’-B, e,(x),

iii) V y, y E IC-B, t&(y),

iv) V z, z E IC-A, t&(z)

Proo$ Since A’ and B’ represent the integrated schema representations of the underlying entity classes A and B it needs to be shown that: i) V r, r E IC-A’, r I- Di and ii) V s, s E IC-B’, s I- Dz i) We know that all constraints r, r E IC-A’ were generated by 0,(w) or eK(x).

a) w E IC-A + w I- Di (by definition) is true. Hence, V r, r E IC-A’ generated by e,.(w), r I- Di is true. b) A CBsubsume B + x I- Di is true V x, x E IC’-B . Hence, V r, r E IC-A’ generated by e,(x), r l- Di is also true.

ii) We know that all constraints s, on s E IC-B’ were generated by e&y) or ea.(z). a) y E IC-B + y I- Ds (by definition) is true. Hence, V s, s E IC-B’ generated by 8&y), s I- Ds is true. b) A CBequiv B + z I- Ds is true V z, z E IC-A . Hence, V s, s E IC-B’ generated by t&.(z), s I- Dz is

also true. 0

Informally, the first item transforms all constraints involving A into constraints involving A’. The second item transforms the subset of constraints involving B that are applicable to A into constraints on A’. The third item transforms all of B’ constraints into constraints on B’. The final item shows that all of A’s constraints are inherited by B’.

Example 2 Sample Schema

Ships(xl,OWNER,SHIPTYPE,ORIGIN,WT,x6,x7) Oil_Tankers(xl,COUNTRY,OWNER,TYPE,DEADWT) with the equivalences: Ships RWsubsumes Oil-Tankers Ships.SHIPTYPE RWequiv to Oil_Tankers.TYPE. Ships.OWNER RWequiv to Oil_Tankers.OWNER. Ships.OR1GI.N RWequiv to Oil_Tankers.COUNTRY. ShipsDEADWT RWequiv to Oil_Tankers.DEADWT

IC-A: OWNER=‘Onassis’~ Ships(xl,OWNER,supertanker,ORIGIN,WT,x6,x7) SHIPTYPE=‘supertanker’-, Ships(xl,OWNER,SHIPTYPE,ORIGIN,WT,x6,x7), WT > 200

K-B: 7 Oil_Tankers(x 1 ,ICELAND,OWNER,TYPE,DEADWT) TYPE =‘pressurized tanker’7 Oil_Tankers(xl,uae,OWNER,TYPE,DEADWT)

Page 11: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constraint Integration in Heterogeneous Databases 433

Integrated Schema: Int_Ships(xl,OWNER,SHIPTYPE,COUNTRY,WT,x6,x7)

Int_Oil_Tankers(x1,COUNTRY,OWNER,TYPE,x5) 0

Consider the schemas shown in Example 2. If Ships and Oil-Tankers had a CBsubsume relationship, the set of integrated constraints, applying Rule 3, would be IC-Int Ships: OWNER=‘Onassis’l Int_Ships(xl,OWNER,supertanker,ORIGIN,WT,x6,x7) SHIPTYPE=‘supertanker’~ Int_Ships(xl,OWNER,SHIPTYPE,ORIGIN,WT, x6,x7), WT > 200 7 Int_Ships(x1,OWNER,SHIPTYPE,iceland,WT, x6,x7) SHIPTYPE=‘pressurized tanker’1 Int_Ships(xl,OWNER,SHIPTYPE,uae,WT, x6,x7)

IC Int Oil Tankers: OWNER=‘Onassis’l Int_Oil_Tankers(x1,COUNTRY,OWNER,supertanker,DEADWT) SHTPTYPE=‘supertanker’~Int_Oil_Tankers(xl,COUNTRY,OWNER,supe~kerpEADWT), DEADWT > 200 1 Int_Oil_Tankers(xl ,ICELAND,OWNER,TYPE,DEADWT) TYPE =‘pressurized tanker’7 Int_Oil_Tankers(xl,uae,OWNER,TYPE,DEADWT)

It should be noted that the process of integrity constraint integration has resulted in additional knowledge (in the form of constraints) being generated at the integrated level. The ability to generate such additional constraints results in a more complete definition of the integrated schema. It also means that there are more constraints associated with some of the entity classes at the integrated level. The use of these additional constraints for semantic query optimization can potentially result in substantial savings.

Case 3: A RWoverlap B

Schematic integration in this case results in the generation of a new entity class AB’ which has A’ and B’ as its two subclasses. AB’ can be thought of as representing the commonalities between A and B. Table 1 suggests that three possible CB relationships can exist: A CBequiv B, A CBsubsume B, and A CBoverlap B.

Rule 4: If A CBequiv B, then the individual integrity constraints involving A and B are transformed as follows:

i) V x, x E IC-A, e,(x),

ii) V y, y E K-A, @M&), iii) V z, z E K-A, 0,,(z)

iv) V u, u E IC-B, t&(u),

v) V v, v E IC-B, O&v),

vi) V W, w E IC-B, e,(w)

Proof: In this case, three things need to be shown: i) V q, q E IC-AB’, q I- Dr and q I- D2 , ii) V r, r E IC-

A’, r I- D, and iii)V s, s E IC-B’, s I- D2

i) We know that all constraints q, q E IC-AB’ were generated by t3&y) or e&v).

a) y E IC-A + y I- D, (by definition) is true. Hence, V q, q E IC-AB’ generated by B&y), q I- D, is

true. Also, A CBequiv B + y I- Dz, V y, y E IC-A. Hence, , V q, q E IC-AB’ generated by tFIA(y), q I-

Dz is also true

b) v E IC-B + v I- D2 (by definition) is true. Hence, V q, q E IC-AB’ generated by e&v), q I- Dz is

true. Also, A CBequiv B + v I- Di, V v, v E IC-B. Hence, , V q, q E IC-AB’ generated by e,(v), q I-

D, is also true

Page 12: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

434 VENKATARRAMESHANDSUDHARAM

ii) We know that all constraints r, r E IC-A’ were generated by e,(x) or e,(w). a) x E IC-A + x I- Di (by definition) is true. Hence, V r, r E K-A generated by t!),(x), r I- Di is true. b) A CBequiv B + w I- Di is true V w, w E IC-B . Hence, V r, r E IC-A’ generated by eN(w), r I- Di is also true.

iii) We know that all constraints s, on s E IC-B’ were generated by f&(u) or @a(z). a) u E IC-B + u I- D2 (by definition) is true. Hence, V s, s E IC-B’ generated by ea.(u), s I- D2 is true. b) A CBequiv B + z I- D2 is true V z, z E IC-A . Hence, V s, s E IC-B’ generated by e,,(z), s I- D2 is

also true. 0

The transformation rules for the overlap case can be interpreted as follows: rules i) and ia) represent straight forward transformation, since A’ and B’ are the integrated schema representations of A and B respectively. Rules ii) and iia) transform constraints involving A and B that are valid in Dz and Dt respectively. Since we have a CBequiv relationship this is the entire set IC-A and IC-B. An implicit assumption made here is that, if all constraints involving A and B are valid, then the attributes involved in the constraints would have been determined to be part of the overlap between A and B and would exist in AB’. Rules iii) and iiia) represent the inheritance of constraints by the subclasses of AB’.

Rule 5: If A CBsubsume B, then the individual integrity constraints involving A and B are transformed as follows:

i) V X, X E IC-A, e,!(x),

ii) V Y, Y E IC-A, eAB’(Y),

iii) V 2, 2 E IC-A, ea$z)

i) V U, u E IC-B, 8&i), ii) V V, V E IC’-B, t&a’(v), iii) V w, w E IC’-B, e,‘(w)

Proo$ In this case, three things need to be shown. i)Vq,qE IC-AB’,qI-Diandql-Dz,ii)Vr,rE IC-A’,rI-Diandii)Vs,sE IC-B’,sI-Dz

The proof for Rule 5 is similar to Rule 4.

i) We know that all constraints q, q E IC-AB’ were generated by &a(y) or 8,&v). a) y E IC-A + y I- D, (by definition) is true. Hence, V q, q E IC-AB’ generated by &a(y), q I- Di is true. Also, A CBsubsume B + y I- D2, V y, y E XC-A. Hence, , V q, q E IC-AB’ generated by 9~Jy), q I- Dz is also true b) v E IC-B -_) v I- D2 (by definition) is true. Hence, V q, q E IC-AB’ generated by &a’(v), q I- D2 is true. Also, A CBsubsume B + v I- Di, V v, v E IC’-B. Hence, , V q, q E IC-AB’ generated by e,.(v), q I- Di is also true

ii) We know that all constraints r, r E IC-A’ were generated by OK(x) or R,(w). a) x E IC-A -+ x I- Di (by definition) is true. Hence, V r, r E IC-A’ generated by e,(x), r I- Di is true. b) A CBsubsume B + w I- Di is true V w, w E IC’-B. Hence, V r, r E IC-A’ generated by Q,(w), r I- D, is also true. iii) We know that all constraints s, on s E IC-B’ were generated by t&(u) or t&(z). a) u E IC-B + u I- Ds (by definition) is true. Hence, V s, s E XC-B’ generated by t3a.(u), s I- Dz is true. b) A CBsubsume B + z I- Dz is true V z, z E IC-A . Hence, V s, s E IC-B’ generated by ea.(z), s I- Ds

is also true. 0

Informally, the difference between Rule 5 and Rule 4 is that instead of transforming all constraints involving B into constraints involving AB’, we only transform those constraints that are valid in database Di (IC’B). Thus, only these constraints get inherited and transformed into constraints involving A’.

Page 13: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constraint Integration in Heterogeneous Databases 43.5

Rule 6: If A CBoverlap B, then the individual integrity constraints involving A and B are transformed as follows:

i) ‘d x, x E IC-A, e,.(x),

ii) V y, y E IC’-A, O.&y), iii) V z, z E IC’-A,Ba(z)

i) V u, u E IC-B, f&(u), ii) V v, v E IC’-B, O,,&V), iii) V w, w E IC-B, e,(w)

Rule 6 is also similar to Rules 4 and 5 with the difference being that only those constraints involving A and B that are valid in the other database are transformed into constraints involving AB’ and are inherited by A’ and B’ respectively. The proof is similar to the other two cases and is omitted for brevity.

Case 4: A RWdisjoint B

Schema integration results in the generation of two unrelated entity classes A’ and B’ in the integrated schema. Since constraint-based relationships are not generated between RWdisjoint entity classes a CBdisjoint relationship is assumed to exist between these entity classes. Hence, the transformation rules require the transformation of the constraints involving A and B into constraints involving A’ and B’ respectively. Formally,

\d x, x E IC-A, e,,(x),

v Y,Y E Ic-& 8Bo)

4. SEMANTIC QUERY PROCESSING

Semantic query processing (SQP) techniques [lo, 41 utilize integrity constraint knowledge to transform queries into more efficient semantically equivalent queries. It has been shown that the application of SQP techniques can result in significant amount of savings compared to using syntactic query optimization techniques alone [ 10, 201. In this section, we describe how the generation of an integrated set of integrity constraints can facilitate SQP in a heterogeneous database environment.

In a heterogeneous database environment, users formulate queries on the integrated schema. These queries are then translated into sub-queries (using the global to local mapping information) in the languages of the databases that need to be accessed [ll]. By generating an integrated set of integrity constraints, we can perform semantic query processing at the integrated schema level by treating the integrated schema and the integrated constraints as being representative of a single database. Once we have performed SQP and transformed the queries appropriately, the semantically transformed queries can be translated into sub-queries on the underlying databases. An alternative approach to performing SQP in a heterogeneous database environment, that would not require the use of our integrity constraint integration methodology, would be to first transform the queries into local queries and then perform SQP based on the individual set of integrity constraints defined on the databases.

As noted earlier, the process of integrity constraint integration can sometimes result in the generation of integrity constraints that are applicable at the global schema level, but are not explicitly specified at the individual database level. The presence of these additional constraints provides the semantic query processor with more opportunities for transforming a query specified on the global schema before issuing it to the local schema. The ability to perform SQP against heterogeneous databases can result in substantial savings because of two primary reasons: 1) it can eliminate access to one of the underlying databases and 2) it can result in an improved query being generated to all of the underlying databases compared to improved queries being generated to a single database.

Below, we present two examples that illustrate the savings that can be achieved by performing SQP using an integrated set of integrity constraints (which requires the use of our methodology) when compared to using the individual (local) integrity constraints. In the discussion below, we ignore any cost overhead introduced by the databases being (possibly) distributed. All estimations are based on database

Page 14: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

436 VENKATARRAMESHANDSUDHARAM

access costs estimated using the techniques discussed in [ 10, 131. Let us assume the following parameters for the underlying databases: Ships = 20,000 tuples; Boats = 25,000 tuples and Oil-Tankers = 10,000 tuples. No. of tuples per page = 20, Time per page fetch = .Ol seconds.

Example 3

Schema A: Ships (xl, OWlVER,SHIPTYPE,ORIGIN,Deadwt,x6,x7)

SHIPTYPE=‘supertanker’+ Ships(xl,OWNER,SHIPTYPE,ORIGIN,Deadwt,x6,x7),

Deadwt > 150

Schema B: Boats (xl,COUNTRY,OWNER,TYPE,Weight)

t BOATS(x1, iceland,OWNER,TYPE, Weight)

If we assume that SHIPS and BOATS are both SCHequiv and CBequiv the resultant integrated schema and integrity constraints will be as shown below.

Integrated Schema: Ship_Boats(x1,OWNER,SHIPTYPE,COUNTRY,Wt,x6,x7,x8)

SHIPTYPE=‘supertanker’c Ship_Boats(xl,OWNER,SHIPTYPE,COUNTRY,Wt,x6,x7,x8), Deadwt

>150

c Ship_Boats(xl,OWNER,SHIPTYPE,iceland,Wt,x6,x7,x8) I3

Assume that the user now issues the following query

Select Owner from Ship-Boats where Shiptype = ‘Bulk Cargo’ and Deadwt > 250

Without our methodology, the sequence of steps followed to transform the query would be as follows: 1) Transform the query into sub-queries. This would result in queries on Ships and Boats being generated. 2) Semantically process the sub-queries using the integrity constraints specified against each relation at each local database. Thus, the query on Boats would not be semantically transformed since there is no applicable constraint. The query on Ships would be transformed and the need to access the SHIPS relation would be eliminated. Hence, the estimated cost of database access in this case would be

time to access Ships + time to access Boats = 0 + 12.5 = 12.5 units However, if we use our methodology the sequence of steps followed in query transformation would be as follows:

1) Semantically transform query. Since there is a constraint on SHIP-BOATS that can transform this query, the transformation is applied, resulting in access to both underlying databases being eliminated. Hence, the estimated time for database access in this case is 0 units.

Case 2: Index Introduction

Consider the example schemas shown in Example 2. Assume that the user issues the following query against the integrated schema: Select Owners from Oil-tankers where DEADWT > 200

Let us assume that the attributes SHIPTYPE in the SHIPS relation and TYPE in the Oil-Tankers relation are indexed attributes.

Without the use of our methodology, we would transform the query into a sub-query on the Oil-Tankers relation. This sub-query would then be subjected to SQP based on the individual constraints specified against the Oil-Tankers relation on the local database. However, this sub-query will go undergo no semantic transformation since no applicable constraints exist in the relation.

Using our methodology, due to the presence of the additional constraint (highlighted in Example 2) on the entity Int_Oil_Tankers (generated by the integrity constraint integration process), the user’s query would be transformed into a query using the TYPE attribute in the Int_Oil_Tankers entity. This transformed query would then be issued to the database as a sub-query which uses the indexed attribute TYPE in the Boats relation, which in turn could lead to substantial savings in query processing time.

Page 15: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

4.1 Simulation Study

Integrity Constraint Integration in Heterogeneous Databases 437

It is clear from the above examples, that under the appropriate circumstances the use of our methodology can result in considerable savings in query processing time. To illustrate the benefits of using our methodology we simulated the execution of a 1000 queries on a sample database. Two simulation models were constructed, one representing a database that does not utilize the IC integration methodology proposed in this paper (control group) and one that uses the methodology proposed in this paper (the experimental group). The models were constructed using the SIMAN language. The simulations themselves were run on a VAX 4000/300 machine running the VMS operating system.

4

* Associate Entity Category

1 Yes r--i Transform No

Query? w

f Calculnte Delay for Semantically Transformed Query’ Calculate Delay for non-transformed Query

Fig. 2: Logical Simulation Model

Page 16: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

438 VENKATAR RAMESH AND SUWA RAM

Figure 2 shows a logical representation of how the simulation models calculate the delay associated with a query. The figure shows that a query’s delay is determined based on four factors:

1) The first factor that affects the delay associated with a query is the type of entity class in the global schema on which the query is formulated. Since the entity classes in the global schema are generated as a result of the integrated schema generation process, we can divide the set of entity classes into six distinct categories [12]: equivalence class, superclass of a subsumption relationship, subclass of a subsumption relationship, union class generated as a result of an overlap relationship, original classes of an overlap relationship and disjoint relationship. For example, an entity belonging to the equivalence class in the integrated schema would be generated as a result of an equivalence relationship being specified between two entity classes in the local schemas. Once the entity class category associated with the query has been determined, the actual entity class belonging to the appropriate category is selected at random (from our

example databases). 2) The second characteristic that is associated with a query is whether the query is going to be

semantically transformed. Since one of the purposes of our study was to evaluate the effect of semantic query transformation, we varied the percentage of queries that would be semantically transformed in our experiments.

3) The third characteristic that is associated with a query is the type of semantic transformation that it is going to undergo. As noted by King [lo], there are several different types of semantic transformations. In our study, we simulated the different types of semantic transformations by associating different delays for the various transformations.

4) Once the characteristics described above have been associated with a query, the delay associated with the query is determined by the simulation model. This delay represents an estimate of the time it would take to process the (transformed or non-transformed) queries on the underlying databases. As noted in step 1, a query can be formulated on six different categories of entity classes. The discussion in Section 3.3 shows that the rules for generating integrated integrity constraints are dependent on the category of relationship between the underlying entity classes. Hence, the delay associated with a particular query is calculated as a function of the entity class category on which the query is formulated.

The formulae for calculating the delays for the various cases in our simulation study are summarized in Table 2.

I Entity Class Category Experimental Group Control Group I

Equivalence Class I I

Dt= Gsqo + DSQl + DSQ2 Dt = Lsqo + DSQl + DQ2

Supe~lass of Subsumption Dt=Gsqo+DSQl Dt=L.sqo+DSQl

Subclass of Subsumption

Union Class of Overlap

Individual Class of Overlap

Disjoint Classes

Case 1: Dt =Gsqo + DSQl

Case 2: Dt =Gsqo + DSQl

Casel:Dt=Gsqo+DSQl+DSQ2

Case 2: Dt = SQl + DQ2

Case l:Dt=Gsqo+DSQl

Case 2: Dt = Gsqo + DSQl

Dt=Gsqo+DSQl

Case 1: Dt = L.sqo + DSQl

Case2:Dt=Lsqo+DQl

Casel:Dt=Lsqo+DSQl +DQ2

Case 2: Dt = Lsqo + DSQl + DQ2

Case 1: Dt = L.sqo + DSQl

Case 2: Dt = L.sqo + DQl

Dt=Lsqo+DSQl

Table 2: Delay Calculations for Various Entity Class Categories

Page 17: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constraint Integration in Heterogeneous Databases 439

Legend: Gsqo - Time taken to generate optimized queries at the global level. This represents a measure of the amount of time it takes to semantically process a query and generate an alternate query(s). Lsqo - Time taken to generate optimized queries at the local database level. This represents the amount of time it takes to semantically process a query at the individual database level. Dt - Total anticipated delay for the query. This estimate includes the amount of time needed to optimize the queries as well as execute the queries on the appropriate database. DSQi - Delay for executing an optimized query i. DQi - Delay for executing a non-optimized query i.

4.2 Parameters and Procedures

Based on the description of the simulation model presented in the previous section it is clear that two key factors play a role in determining the delay for a query: 1) P e - the entity class category (in the integrated schema) on which the query is issued and 2) P t - the probability that a query will be transformed. Hence, these were the two main factors varied in our simulation study.

To study the effect of the entity class category on a query’s delay, we created seven different scenarios. In six of the seven scenarios (B-F) a majority of the queries were associated with a particular category of entity class (in the integrated schema). In scenario A, the probabilities were set up to ensure an even distribution of queries across entity class categories. The different scenarios and the probability that a query will be associated with a particular entity class categoric in each scenario are shown in Table 3. We were also interested in evaluating the effect of probability of semantic query transformation on a query’s delay. Hence, we also used eight different transformation probability levels in our study. The various probabilities of transformations used in the study are listed in Table 4.

Table 3: Probability of Semantic Query Transformation Used in Each Scenario

Condition 1 2 3 4 5 6 7 8 % of Queries Transformed 0.00 5.00 15sxl 25.00 50.00 62.50 75.00 87.50 Semantically

Table 4: Various Transformation Probabilities Used in the Study

Thus, for each of the seven scenarios, we ran 8 simulations (one for each level of transformation probability). Hence, data was collected for a total of 56 different simulation runs. Each run simulated the execution of 1000 queries through the control group or experimental group model and was repeated 20 times to ensure sufficiently tight confidence intervals. We used the two sample databases shown in Table

Page 18: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

440 VENKATARRAMESHANDSUDHARAM

5 in our study. The size of each relation (in tuples) in the databases is shown in parentheses. The first database was identical to the database used by King [ 101 to study semantic query processing and consisted of relations ranging in size from 1,000 to 25,000 tuples. The second database consisted of relations that were related by an equivalence, subsumption, overlap or disjoint relationship to relations in the first database. The size of the relations in this database ranged from 700 to 25,000 tuples. The integrated schema and the categories which the entity classes in the schema belong to are shown in Table 6.

Since we used only SELECT queries in our study, the delay for non-optimized queries was set to the amount of time it would take to scan through a table (assuming no indexed attribute exists). All delays were calculated based on a page access time of 5 ms and a page size of 4K, assuming that 20 records can fit into one page. For example, for the SHIPS relation with 20,000 tuples, the non-optimized delay was calculated as:

Number of pages = 20000/20 = 1000; Delay = 1000 * 5 = 5000 ms Delays for optimized queries were set at O-50% of the non-optimized query delay. These differing

degrees of delay in the optimized queries were used to simulate the different types of semantic query transformations possible. Shekhar et al. [ 191 used a Gsqo value of 5 ms for a database of 100 constraints [19]. In our study, a proportional value for Gsqo based on the number of constraints was calculated and used. The delay for Lsqo was set at half of the Gsqo value based on the assumption that the underlying databases contributed equally towards the global constraints.

Database 1 Database 2

SHIPS (20,000) BOATS (25,000)

PORTS (1,000)

CARGOES (25,000)

OILTANKERS (10,000)

NON-OILTANKBRS (15,000)

OWNERS (1,000)

POLICIES (25,000)

INSURERS (500)

MERCHANDISE ( 15,000)

INSCOMP (700)

SHORT-TERM-POL (10,000)

LONG-TERM-POL (13,000)

Table 5: Individual Database Schemas

Entity Class Name

SHIP-BOATS

PORTS

Entity Class Category

EQUIV, SUP

DISJ

INSURERS

OILTANKERS

NON-O&TANKERS

CARGOES

MERCHANDISE

SHIPS-LOAD

POLICIES

SHORT-TERM-POL

LONG-TERM-POL

OWNERS

EQUN

SUB

SUB

INDOVERLAP

INDOVERLAP

UNIONOVERLAP

SUP

SUB

SUB

DISJ

Table 6: Integrated Schema

Page 19: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constraint Integration in Heterogeneous Databases 441

4.3 Results

Table 7 shows the delay for 1000 queries (averaged across 20 simulation runs) in the control and experimental group in Scenario A. Recall from Table 3 that scenario A distributed the 1000 queries evenly across entity classes (in the integrated schema) belonging to the six entity class categories. Each column in Table 7 represents the delay for 1000 queries when a certain percentage of queries are subject to semantic query transformation. The last row indicates the level of statistical significance at which there is a difference between the average delays in the control and experimental groups. An examination of the

first two columns of Table 7 indicates that there isn’t a significant difference in the delays, at the a = .05

level, when only 0% or 5% of the queries are subject to semantic query transformation. However, an examination of the p-values in the rest of the columns indicates that there is a significant difference between the means at the same level of significance (since the p-values are < .05). This indicates that when at least 15% of the queries are transformed semantically, the SQP at the integrated schema level results in significant savings in query processing time. Figure 3 shows a diagrammatic representation of the delays in the control and experimental conditions and the savings in time that can be achieved by the use of our methodology.

Even Distribution of Queries Across Au Entity Class Categories. Time for loo0 queries (x IO’)

Table 7: Scenario A

3.5OE+O6

P 2mEm z E I= 150&06

l.OOE-606

5.OOElO5

O.OOE+OO

--c Control

+&P

0.00 5.Lul 15.00 25.00 50.00 62.50 75.00 87.50

Transformtion Robabiiii

Fig. 3: Condition A: Even Distribution of Queries Across All Entity Class

Page 20: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

442 VENKATARRAMESHANDSUDHARAM

Heavy Concentration of Queries on Superclass Entity Classes. Time for loo0 queries (x 10’)

Table 8: Scenario B

4.00E+06

3.50E+06

3.00E+06

2.50E+06

3

; 2.00E+06

I= 1.50E+06

1 .OOE+06

5 .OOE+05

O.OOE+OO

15.00 25.00 50.00 62.50 75.00 87.50

Transformation Probability

Fig. 4: Condition B: Heavy Concentration of Queries on Superclass Entity Classes

Page 21: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constraint Integration in Heterogeneous Databases 443

Heavy Concentration of Queries on Individual Overlap Entity Classes Time for loo0 queries (x 1 Oh)

Table 9: Condition F

3.5 ;OE. 4-06

3.00E+O6

?.50E+o6

&OOE+o6

5

2 H .50E+O6

I .OOE+O6

5 .OOE+OS

0.c IOE +oa 0.00 5.00 15.00 25.00 50.00 62.50 75.00 87.50

Transformation Probability

Fig. 5: Condition F: Heavy Concentration of Queries on Individual Overlap Entity Classes

Page 22: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

444 VENKATAR RAMESH AND SUDHA RAM

Heavy Concentmtion of Queries on Equivalent Entity Classes Time for 1000 queries (x 10’)

Table 10: Condition G

4.508+06

4.00E+O6

3.50E+O6

3 .OOE+O6

3 2.50E+O6 g

j 2.00E+O6

1.5OE+O6

1 .OOE+O6

5 .OOE+O5

O.OOE+OO

15.00 25.00 50.00 62.50 75.00 87.50

Transformation Probability

Fig. 6: Condition G: Hewy Concentmtion of Queries on Equivalent Entity Classes

Page 23: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

Integrity Constraint Integration in Heterogeneous Databases 445

Tables 8 through 10 show the mean delays for experimental and control groups for scenarios B, F and G respectively. Each of these tables should be interpreted in a fashion similar to Table 7. An examination of these tables shows that in each of the scenarios (which represent the different ways in which a query issued against the global schema can be semantically transformed into more efficient database queries) there is a statistically significant difference in query processing time between the control and experimental group when at least 15% of the queries are semantically transformed. These savings (for scenarios B, F and G) are represented diagrammatically in Figures 4 through 6 (Note: We have omitted the diagrams for the other conditions to conserve space. The results for the other conditions show similar characteristics). It is clear that the curves in Figures 4 through 6 are similar, the only difference being in the starting and ending values of the average delay for 1000 queries. This demonstrates that the savings from the use of our methodology are not dependent on the category of entity class on which the majority of queries are

issued. These results suggest that the use of our methodology for integrity constraint integration can result in

savings in query processing time as long as sufficient integrity constraints at the integrated schema level can be generated. However, it should be noted that the generation of these constraints is in turn dependent on the number of explicitly specified constraints at the local database level and the existence of related objects in the databases being integrated. This is important because if none of the queries being formulated can be transformed by the integrated integrity constraints then the use of our methodology would not result in any savings.

5. CONCLUSIONS

In this paper, we presented a seven-step methodology that enhances the traditional schema integration methodology by utilizing integrity constraint knowledge to augment the schema integration process and derives a set of integrity constraints applicable to the integrated schema. We introduced the concept of constraint-based relationships and presented techniques for generating an integrated set of integrity constraints based on these relationships. Finally, we described how the integrated set of constraints can be used in semantic query processing and how this can result in significant savings in query processing time in a heterogeneous database environment.

This paper makes a substantial contribution to the database literature by demonstrating the role of integrity constraints in facilitating heteregeneous database interoperability. However, several extensions to the methodology presented in the paper are possible. We identify several crucial areas of research below.

One of the assumptions made in our methodology for integrity constraint integration is that the underlying databases have a well-defined set of integrity constraints explicitly specified on them. However, such constraints are typically embedded within the application programs that use the database. Hence, a key area of future research is the discovery of integrity constraints in legacy systems. Shekhar et al. [19] present a technique for discovering integrity constraints in a database using data mining techniques. We plan to use this work as a starting point for exploring this avenue of research.

Another assumption made in our methodology for integrity constraint integration was that the set of integrity constraints specified on a database represents a complete set of constraints. However, such a complete specification of constraints may not always be available. Hence, we need to develop techniques that relax this restriction and extend the applicability of our approach to databases with an incomplete specification of integrity constraints.

Finally, we have also made the assumption that the set of constraints specified on a database remains relatively static. However, frequent changes to underlying databases’ constraints would require regeneration of the integrated set of integrity constraints. Over time, the cost of generating integrated integrity constraints can begin to accrue and may nullify any query processing savings achieved due to the use of our methodology. We would like to develop a more comprehensive simulation model that takes the dynamic nature of the constraints into account. Such a model will enable us to evaluate the potential benefits of the use our methodology in the face of changing constraints.

Acknowledgemenrs -This is a revised and expanded version of a paper presented at the WITS ‘94 workshop.

Page 24: Integrity constraint integration in heterogeneous databases: An enhanced methodology for schema integration

446 VENKATARRAMESHANDSUDHARAM

111

PI

[31

[41

t51

WI

171

181

[91

REFERENCES

H. An and L.J. Henschen. Knowledge Based Semantic Query Optimization. btternationai Symposium on Methodologies for Intelligent Systems, pp. 82-91 (1992). C. Batini, M. Lenxerini and S.B. Navathe. A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surveys, 18(4):323-364 (1986). E. Bertino and D. Musto. Query Optimization by using Knowledge about Data Semantics. Datu and Knowledge Engineering, 9:121-155 (1992). U.S. Chakravarthy, D. Fishman and J. Minker. Logic-Based Approach to Semantic Query Optimization. ACM Transuctions on Dumbuse Systems, 15(2):162-207 (1990). P.P. Chen. The Entity-Relationship Model: Toward a Unified View of Data ACM Trunsactions on Datubase Systems, l(1): 9-36 (1976). CA. Draxler. A Powerful Prolog to SQL Compiler. CIS Centre for Information und Language Processing, Ludwig-Maximilians- Universitit-Mttnchen (1993). W. Gotthard, PC. Lockemann and A. Neufeld. System-Guided View Integration for Object-Oriented Databases. IEEE Transactions on Knowledge und Dam Engineering, 4(l): l-22 (1992). M. Hammer and D. M&sod. Semantic integrity in a relational data base system. Proceedings of the First VLDB Conference (1975). S. Hayne and S. Ram. Multi-User View Integration (MUVIS): An Expert System for View Integration. Proceedings of the Sixth International Conference on Dutu Engineering , pp. 402-409 (1990).

[lO]J.J. King. QUIST: A System for Semantic Query Optimization in Relational Databases. Proceedings of the Seventh VLDB Conference, pp. 510-517 (1981).

[1 l]T. Landers and R. Rosenberg. An overview of multibase. In H. Schneider, editor, Distributed Data Systems, pp. 153-184, North-Holland (1982).

112lJ.A. Larson, S.B. Navathe and R. Elmasri. A Theory of Attribute Equivalence in Databases with Application to Schema Integration. IEEE Transaction on Softwure Engineering, X(4):449-463 (1989).

[13]L.F. Mackert and G.M. Lehman. R* Optimizer Validation and Performance Evaluation for Local Queries. Proceedings ofACM SIGMOD, , pp. 84-95 (1986).

[14]M.J. Pan, SK. Chang, and C.C.Yang. A Semantic Query Processing in Multidatabase Systems: A Logic-Based Approach. Proceedings of Future Trends in Distributed Computing Systems , pp. 318-324 (1992).

[15]V. Ramesh and S. Ram. A Methodology for Interschema Relationship Identification in Heterogeneous Databases. Proceedings of the 28th Huwaii International Conference on System Sciences , pp. 263-272, Hawaii (1995).

[16]M.P. Reddy, M. Seigel and A. Gupta. Towards an Active Schema Integration Architecture. for Heterogeneous Database Systems. Proceedings ofRIDPIMS’93 - Research Issues in Data Engineering: Interoperobility of Multidatabase Systems, pp. 178-183 (1993).

[17]R. Reiter. Towards a Logical Reconstruction of Relational Database Theory. In On Conceptual Modeling, M.L.Brodie, J. Mylopoulos and J.W.Schmidt, editors, Springer Verlag, pp. 191-233 (1984).

[18] M. Seigel, E. Sciore and S. Salveter. A Method for Automatic Rule Derivation to Support Semantic Query Optimixation. ACM Trunsuctions on Database Systems, 17:563-600 (1992).

[ 191s. Shekhar, B. Hamidxadeh, A. Kohli and M. Coyle. Learning Transformation Rules for Semantic Query Optimization: A Data-Driven Approach. IEEE Transactions on Knowledge and Data Engineering, 5(6):950-964 (1993).

[20]S.T. Shenoy and Z.M. Oxsoyoglu. Design and Implementation of a Semantic Query Optimizer. IEEE Transactions on Knowledge und Dutu Engineering, 1(3):344-361 (1989).

[2l]A.P. Sheth and S. Gala. Attribute Relationships: An Impediment in Automating Schema Integration. Proceedings of the NSF Workshop on Heterclgeneous Datubuse Systems (1989).

[22]P. Shoval and S. Zohn. A Binary-Relationship Integration Methodology. Dam und Knowledge Engineering, 6: 225-250 (1991).

[23]S. Spaccapietra, C. Parent and Y. DuPont. Independent Assertions for Integration of Heterogeneous Schemas. Very Large Dutabase Journal, l(1) (1992).

t24lH.J.A. van Kuijk, F.H.E. Pjipers and P.M.G. Apers. Semantic Query Optimization in Distributed Databases. Proceedings of Advunces in Computing und Inform&on - ICCl’90, pp. 295-303 (1990).