mc0077–advanced database systems-assignement-

Master of Computer Application (MCA) – Semester 4

MC0077 – Advanced Database Systems

Assignment Set – 1

1. Explain the following normal forms with a suitable example

demonstrating the reduction of a sample table into the said

normal forms:

A) First Normal Form

B) B) Second Normal Form

C) C) Third Normal Form

Ans:-

The normal forms defined in relational database theory represent

guidelines for record design. The guidelines corresponding to first through fifth

normal forms are presented here, in terms that do not require an understanding of

relational theory. The design guidelines are meaningful even if one is not using a

relational database system. We present the guidelines without referring to the

concepts of the relational model in order to emphasize their generality, and also to

make them easier to understand. Our presentation conveys an intuitive sense of the

intended constraints on record design, although in its informality it may be

imprecise in some technical details. A comprehensive treatment of the subject is

provided by Date [4].

1

The normalization rules are designed to prevent update anomalies and data

inconsistencies. With respect to performance tradeoffs, these guidelines are biased

toward the assumption that all non-key fields will be updated frequently. They tend

to penalize retrieval, since data which may have been retrievable from one record

in an unnormalized design may have to be retrieved from several records in the

normalized form. There is no obligation to fully normalize all records when actual

performance requirements are taken into account.

2 FIRST NORMAL FORM

First normal form [1] deals with the "shape" of a record type.

Under first normal form, all occurrences of a record type must contain the same

number of fields.

First normal form excludes variable repeating fields and groups. This is not so

much a design guideline as a matter of definition. Relational database theory

doesn't deal with records having a variable number of fields.

3 SECOND AND THIRD NORMAL FORMS

Second and third normal forms [2, 3, 7] deal with the relationship between non-key

and key fields.

Under second and third normal forms, a non-key field must provide a fact about

the key, us the whole key, and nothing but the key. In addition, the record must

satisfy first normal form.

We deal now only with "single-valued" facts. The fact could be a one-to-many

relationship, such as the department of an employee, or a one-to-one relationship,

such as the spouse of an employee. Thus the phrase "Y is a fact about X" signifies

a one-to-one or one-to-many relationship between Y and X. In the general case, Y

2

might consist of one or more fields, and so might X. In the following example,

QUANTITY is a fact about the combination of PART and WAREHOUSE.

3.1 Second Normal Form

Second normal form is violated when a non-key field is a fact about a subset of a

key. It is only relevant when the key is composite, i.e., consists of several fields.

Consider the following inventory record:

---------------------------------------------------

| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |

====================-------------------------------

The key here consists of the PART and WAREHOUSE fields together, but

WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic

problems with this design are:

The warehouse address is repeated in every record that refers to a part stored

in that warehouse.

If the address of the warehouse changes, every record referring to a part

stored in that warehouse must be updated.

Because of the redundancy, the data might become inconsistent, with

different records showing different addresses for the same warehouse.

If at some point in time there are no parts stored in the warehouse, there may

be no record in which to keep the warehouse's address.

To satisfy second normal form, the record shown above should be decomposed

into (replaced by) the two records:

------------------------------- ---------------------------------

| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |

====================----------- =============--------------------

3

When a data design is changed in this way, replacing unnormalized records with

normalized records, the process is referred to as normalization. The term

"normalization" is sometimes used relative to a particular normal form. Thus a set

of records may be normalized with respect to second normal form but not with

respect to third.

The normalized design enhances the integrity of the data, by minimizing

redundancy and inconsistency, but at some possible performance cost for certain

retrieval applications. Consider an application that wants the addresses of all

warehouses stocking a certain part. In the unnormalized form, the application

searches one record type. With the normalized design, the application has to search

two record types, and connect the appropriate pairs.

3.2 Third Normal Form

Third normal form is violated when a non-key field is a fact about another non-key

field, as in

------------------------------------

| EMPLOYEE | DEPARTMENT | LOCATION |

============------------------------

The EMPLOYEE field is the key. If each department is located in one place, then

the LOCATION field is a fact about the DEPARTMENT -- in addition to being a

fact about the EMPLOYEE. The problems with this design are the same as those

caused by violations of second normal form:

The department's location is repeated in the record of every employee

assigned to that department.

If the location of the department changes, every such record must be

updated.

Because of the redundancy, the data might become inconsistent, with

different records showing different locations for the same department.

4

If a department has no employees, there may be no record in which to keep

the department's location.

To satisfy third normal form, the record shown above should be decomposed into

the two records:

------------------------- -------------------------

| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |

============------------- ==============-----------

2. Explain the concept of a Query? How a Query Optimizer works.

Ans:

A database query can be either a select query or an action query. A select

query is simply a data retrieval query. An action query can ask for additional

operations on the data, such as insertion, updating, or deletion.

The aim of query processing is to find information in one or more databases

and deliver it to the user quickly and efficiently. Traditional techniques work well

for databases with standard, single-site relational structures, but databases

containing more complex and diverse types of data demand new query processing

and optimization techniques. Most real-world data is not well structured. Today’s

databases typically contain much non-structured data such as text, images, video,

and audio, often distributed across computer networks. In this complex milieu

(typified by the World Wide Web), efficient and accurate query processing

becomes quite challenging. Principles of Database Query Processing for Advanced

Applications teaches the basic concepts and techniques of query processing and

optimization for a variety of data forms and database systems, whether structured

or unstructured.

5

Query Optimizer

The Query Optimizer is the component of a database management system that

attempts to determine the most efficient way to execute a query. The optimizer

considers the possible query plans (discussed below) for a given input query, and

attempts to determine which of those plans will be the most efficient. Cost-based

query optimizers assign an estimated "cost" to each possible query plan, and

choose the plan with the least cost. Costs are used to estimate the runtime cost of

evaluating the query, in terms of the number of I/O operations required, the CPU

requirements, and other factors.

Query plan

A Query Plan (or Query Execution Plan) is a set of steps used to access

information in a SQL relational database management system. This is a specific

case of the relational model concept of access plans. Since SQL is declarative,

there are typically a large number of alternative ways to execute a given query,

with widely varying performance. When a query is submitted to the database, the

query optimizer evaluates some of the different, correct possible plans for

executing the query and returns what it considers the best alternative. Because

query optimizers are imperfect, database users and administrators sometimes need

to manually examine and tune the plans produced by the optimizer to get better

performance.

The set of query plans examined is formed by examining the possible access paths

(e.g. index scan, sequential scan) and join algorithms (e.g. sort-merge join, hash

join, nested loops). The search space can become quite large depending on the

complexity of the SQL query.

6

The query optimizer cannot be accessed directly by users. Instead, once queries are

submitted to database server, and parsed by the parser, they are then passed to the

query optimizer where optimization occurs.

How to Query optimizer work

The MySQL query optimizer has several goals, but its primary aims are to use

indexes whenever possible and to use the most restrictive index in order to

eliminate as many rows as possible as soon as possible. That last part might sound

backward and nonintuitive. After all, your goal in issuing a SELECT statement is

to find rows, not to reject them. The reason the optimizer tries to reject rows is that

the faster it can eliminate rows from consideration, the more quickly the rows that

do match your criteria can be found. Queries can be processed more quickly if the

most restrictive tests can be done first. Suppose that you have a query that tests two

columns, each of which has an index on it:

SELECT col3 FROM mytable

WHERE col1 = 'some value' AND col2 = 'some other value';

Suppose also that the test on col1 matches 900 rows, the test on col2 matches 300

rows, and that both tests together succeed on 30 rows. Testing col1 first results in

900 rows that must be examined to find the 30 that also match the col2 value.

That's 870 failed tests. Testing col2 first results in 300 rows that must be examined

to find the 30 that also match the col1 value. That's only 270 failed tests, so less

computation and disk I/O is required. As a result, the optimizer will testcol2 first

because doing so results in less work overall.

You can help the optimizer take advantage of indexes by using the following

guidelines:

Try to compare columns that have the same data type. When you use indexed

columns in comparisons, use columns that are of the same type. Identical data

7

types will give you better performance than dissimilar types. For example, INT is

different from BIGINT. CHAR(10) is considered the same

as CHAR(10) or VARCHAR(10) but different

from CHAR(12) or VARCHAR(12). If the columns you're comparing have

different types, you can use ALTER TABLE to modify one of them so that the

types match.

Try to make indexed columns stand alone in comparison expressions. If you

use a column in a function call or as part of a more complex term in an arithmetic

expression, MySQL can't use the index because it must compute the value of the

expression for every row. Sometimes this is unavoidable, but many times you can

rewrite a query to get the indexed column to appear by itself.

The following WHERE clauses illustrate how this works. They are equivalent

arithmetically, but quite different for optimization purposes:

WHERE mycol < 4 / 2

WHERE mycol * 2 < 4

For the first line, the optimizer simplifies the expression 4/2 to the value 2, and

then uses an index on mycol to quickly find values less than 2. For the second

expression, MySQL must retrieve the value of mycol for each row, multiply by 2,

and then compare the result to 4. In this case, no index can be used. Each value in

the column must be retrieved so that the expression on the left side of the

comparison can be evaluated.

Let's consider another example. Suppose that you have an indexed

column date_col. If you issue a query such as the one following, the index isn't

used:

SELECT * FROM mytbl WHERE YEAR(date_col) < 1990;

8

The expression doesn't compare 1990 to an indexed column; it compares 1990 to a

value calculated from the column, and that value must be computed for each row.

As a result, the index on date_col is not used because performing the query

requires a full table scan. What's the fix? Just use a literal date, and then the index

on date_col can be used to find matching values in the columns:

WHERE date_col < '1990-01-01'

But suppose that you don't have a specific date. You might be interested instead in

finding records that have a date that lies within a certain number of days from

today. There are several ways to express a comparison of this type—not all of

which are equally efficient. Here are three possibilities:

WHERE TO_DAYS(date_col) - TO_DAYS(CURDATE()) < cutoff

WHERE TO_DAYS(date_col) < cutoff + TO_DAYS(CURDATE())

WHERE date_col < DATE_ADD(CURDATE(), INTERVAL cutoff DAY)

For the first line, no index is used because the column must be retrieved for each

row so that the value of TO_DAYS(date_col) can be computed. The second line is

better. Both cutoff andTO_DAYS(CURDATE()) are constants, so the right-hand

side of the comparison can be calculated by the optimizer once before processing

the query, rather than once per row. But the date_col column still appears in a

function call, preventing use of the index. The third line is best of all. Again, the

right-hand side of the comparison can be computed once as a constant before

executing the query, but now the value is a date. That value can be compared

directly to date_col values, which no longer need to be converted to days. In this

case, the index can be used.

Don't use wildcards at the beginning of a LIKE pattern. Some string searches

use a WHEREclause of the following form:

WHERE col_name LIKE '%string%'

9

That's the correct thing to do if you want to find a string no matter where it occurs

in the column. But don't put '%' on both sides of the string simply out of habit. If

you're really looking for the string only when it occurs at the beginning of the

column, leave out the first '%'. Suppose that you're looking in a column containing

last names for names like MacGregor or MacDougall that begin with'Mac'. In that

case, write the WHERE clause like this:

WHERE last_name LIKE 'Mac%'

The optimizer looks at the literal initial part of the pattern and uses the index to

find rows that match as though you'd written the following expression, which is in

a form that allows an index onlast_name to be used:

WHERE last_name >= 'Mac' AND last_name < 'Mad'

This optimization does not apply to pattern matches that

usethe REGEXP opeator. REGEXPexpressions are never optimized.

10

3. Explain the following with respect to Heuristics of Query

Optimizations:

A) Equivalence of Expressions

B) Selection Operation

C) Projection Operation

D) Natural Join Operation

Ans:

Equivalent expressions

We often want to replace a complicated expression with a simpler one that means

the same thing. For example, the expression x + 4 + 2 obviously means the same

thing as x + 6, since 4 + 2 = 6. More interestingly, the expression x + x + 4 means

the same thing as 2x + 4, because 2x is x + x when you think of multiplication as

repeated addition. (Which of these is simpler depends on your point of view, but

usually 2x + 4 is more convenient in Algebra.)

Two algebraic expressions are equivalent if they always lead to the same result

when you evaluate them, no matter what values you substitute for the variables.

For example, if you substitute x := 3 in x + x + 4, then you get 3 + 3 + 4, which

works out to 10; and if you substitute it in 2x + 4, then you get 2(3) + 4, which also

works out to 10. There's nothing special about 3 here; the same thing would happen

no matter what value we used, so x + x + 4 is equivalent to 2x + 4. (That's really

what I meant when I said that they mean the same thing.)

When I say that you get the same result, this includes the possibility that the result

is undefined. For example, 1/x + 1/x is equivalent to 2/x; even when you substitute

x := 0, they both come out the same (in this case, undefined). In contrast, x2/x is not

equivalent to x; they usually come out the same, but they are different when x := 0.

11

(Then x2/x is undefined, but x is 0.) To deal with this situation, there is a sort of

trick you can play, forcing the second expression to be undefined in certain cases.

Just add the words ‘for x ≠ 0’ at the end of the expression to make a new

expression; then the new expression is undefined unless x ≠ 0. (You can put any

other condition you like in place of x ≠ 0, whatever is appropriate in a given

situation.) So x2/x is equivalent to x for x ≠ 0.

To symbolise equivalent expressions, people often simply use an equals sign. For

example, they might say ‘x + x + 4 = 2x + 4’. The idea is that this is a statement

that is always true, no matter what x is. However, it isn't really correct to write

‘1/x + 1/x = 2/x’ to indicate an equivalence of expressions, because this statement

is not correct when x := 0. So instead, I will use the symbol ‘≡’, which you can

read ‘is equivalent to’ (instead of ‘is equal to’ for ‘=’). So I'll say, for example,

x + x + 4 ≡ 2x + 4,

1/x + 1/x ≡ 2/x, and

x2/x ≡ x for x ≠ 0.

The textbook, however, just uses ‘=’ for everything, so you can too, if you want.

Selection Operation

1. Consider the query to find the assets and branch-names of all banks who

have depositors living in Port Chester. In relational algebra, this is

2.

3. (CUSTOMER DEPOSIT BRANCH))

4.

o This expression constructs a huge relation,

12

o CUSTOMER DEPOSIT BRANCH

o

of which we are only interested in a few tuples.

o We also are only interested in two attributes of this relation.

o We can see that we only want tuples for which CCITY = ``PORT

CHESTER''.

o Thus we can rewrite our query as:

o

o DEPOSIT BRANCH)

o

o This should considerably reduce the size of the intermediate relation.

Projection Operation

1. Like selection, projection reduces the size of relations.

It is advantageous to apply projections early. Consider this form of our

example query:

2. When we compute the subexpression

3.

we obtain a relation whose scheme is

(CNAME, CCITY, BNAME, ACCOUNT#, BALANCE)

13

4. We can eliminate several attributes from this scheme. The only ones we

need to retain are those that

o appear in the result of the query or

o are needed to process subsequent operations.

5. By eliminating unneeded attributes, we reduce the number of columns of the

intermediate result, and thus its size.

6. In our example, the only attribute we need is BNAME (to join with

BRANCH). So we can rewrite our expression as:

7.

8.

9.

10.Note that there is no advantage in doing an early project on a relation before

it is needed for some other operation:

o We would access every block for the relation to remove attributes.

o Then we access every block of the reduced-size relation when it is

actually needed.

o We do more work in total, rather than less!

Natural Join Operation

1. Another way to reduce the size of temporary results is to choose an optimal

ordering of the join operations.

2. Natural join is associative:

3.

14

4. Although these expressions are equivalent, the costs of computing them may

differ.

o Look again at our expression

o

o

o

o we see that we can compute DEPOSIT BRANCH first and then join

with the first part.

o However, DEPOSIT BRANCH is likely to be a large relation as it

contains one tuple for every account.

o The other part,

o

is probably a small relation (comparatively).

o So, if we compute

o

first, we get a reasonably small relation.

o It has one tuple for each account held by a resident of Port Chester.

o This temporary relation is much smaller than DEPOSIT BRANCH.

5. Natural join is commutative:

6.

o Thus we could rewrite our relational algebra expression as:

o

o

15

o

o But there are no common attributes between CUSTOMER and

BRANCH, so this is a Cartesian product.

o Lots of tuples!

o If a user entered this expression, we would want to use the

associativity and commutativity of natural join to transform this into

the more efficient expression we have derived earlier (join with

DEPOSIT first, then with BRANCH).

Que 4. There are a number of historical, organizational, and technological

reasons explain the lack of an all-encompassing data management system.

Discuss few of them with appropriate examples.

Ans:

A number of historical, organizational, and technological reasons explain the lack of an all-encompassing data management system. Among these are:

· The sensible advice – to build small systems with the plan to extend their scope in later implementation phases – allows a core system to be implemented relatively quickly, but has lead to a proliferation of relatively small systems.

· Department autonomy has led to construction of department specific rather than organization wide systems, again leading to many small, overlapping, and often incompatible systems within an organization.

· The continual evolution of the organization and its interactions both within and to its external environment prohibits complete understanding of future information requirements.

16

· Parallel development of data management systems for particular applications has lead to different and incompatible systems for management of tabular/administrative data, text/document data, historical/statistical data, spatial/geographic data, and streamed/audio and visual data.

The result is that only a portion of an organization’s data is administered by any one data management system and most organizations have a multitude of special purpose databases, managed by different, and often incompatible, data management system types. The growing need to retrieve data from multiple databases within an organization, as well as the rapid dissemination of data through the Internet, has given rise to the requirement of providing integrated access to both internal and external data of multiple types.

A major challenge and critical practical and research problem for the information, computer, and communication technology communities is to develop data management systems that can provide efficient access to the data stored in multiple private and public database Problems to be resolved include:

1. Interoperability among systems

2. Incorporation of legacy systems and

3. Integration of management techniques for structured and unstructured data

Each of the above problems entails an integration of concepts, methods, techniques and tools from separate research and development communities that have existed in parallel but independently and have had rather minimal interaction. One consequence of which is that there exist an overlapping and conflicting terminology between these communities. With this definition, no limitations are given as to the type of:

· Data in the collection,

· Model used to structure the collection, or

· Architecture and geographic location of the database

The focus of this text is on on-line – electronic and web accessible – databases containing multiple media data, thus restricting our interest/focus to multimedia databases stored on one or more computers (DB servers) and accessible from the Internet. Electronic databases are important since they contain data recording the products and services, as well as the economic history and current status of the owner organization. They are also a source of information for the organization’s employees and customers/users. However, databases can not be used effectively

17

unless there exist efficient and secure data management systems, DMS for the data in the databases.

Que.5. Describe the Structural Semantic Data Model (SSM) with relevant

examples.

ANS.

Modelling Complex and Multimedia Data

Data modelling addresses a need in information system analysis and design to

develop a model of the information requirements as well as a set of viable database

structure proposals. The data modelling process consists of:

1. Identifying and describing the information requirements for an information

system,

2. Specifying the data to be maintained by the data management system, and

3. Specifying the data structures to be used for data storage that best support

the information requirements.

A fundamental tool used in this process is the data model, which is used both for

specification of the information requirements at the user level and for specification

of the data structure for the database. During implementation of a database, the

data model guides construction of the schema or data catalog which contains the

metadata that describe the DB structure and data semantics that are used to support

database implementation and data retrieval.

18

Data modelling, using a specific data model type, and as a unique activity during

information system design, is commonly attributed to Charles Bachman (1969)

who presented the Data Structure Diagram as one of the first, widely used data

models for network database design. Several alternative data model types were

proposed shortly thereafter, the best known of which are the:

Relational model (Codd, 1970) and the

Entity-relationship, ER, model (Chen, 1976).

The relational model was quickly criticized for being 'flat' in the sense that all

information is represented as a set of tables with atomic cell values. The definition

of well-formed relational models requires that complex attribute types (hierarchic,

composite, multi-valued, and derived) be converted to atomic attributes and that

relations be normalized. Inter-entity (inter-relation) relationships are difficult to

visualize in the resulting set of relations, making control of the completeness and

correctness of the model difficult. The relational model maps easily to the physical

characteristics of electronic storage media, and as such, is a good tool for design of

the physical database.

The entity-relationship approach to modelling, proposed by Chen (1976), had two

primary objectives: first to visualize inter-entity relationships and second to

separate the DB design process into two phases:

1. Record, in an ER model, the entities and inter-entity relationships required

"by the enterprise", i.e. by the owner/user of the information system or

application. This phase and its resulting model should be independent of the

DBMS tool that is to be used for realizing the DB.

19

2. Translate the ER model to the data model supported by the DBMS to be

used for implementation.

This two-phase design supports modification at the physical level without

requiring changes to the enterprise or user view of the DB content.

Also Chen's ER model quickly came under criticism, particularly for its lack of

ability to model classification structures. In 1977, (Smith & Smith) presented a

method for modelling generalization and aggregation hierarchies that underlie the

many extended/enhanced entity-relationship, EER, model types proposed and in

use today.

Que.6. What are differences in Global and Local Transactions in

distributed database system? What are the roles of Transaction Manager

and Transaction Coordinator in managing transactions in distributed

database?

Ans:

Differences between Global and Local Transaction

A local transaction is one that accesses data only from sites where the transaction

originated. A global transaction, on the other hand, is one that either accesses data

in a site different from the one at which the transaction was initiated, or accessed

data in several different sites.

Roles of Transaction Manager and Transaction Coordinator in managing transactions

in distributed database

20

1. The Transaction Manager manages the execution of those transactions (or

sub transactions) that access data stored in a local site. Note that each such

transaction may be either a local transaction (that is, a transaction that

executes at several sites).

The Transaction manager is responsible for,

1. Maintaining a log for recovery purposes

2. Participating in an appropriate concurrency model scheme to coordinate the

concurrent execution of the transactions executing at that site.

1. The Transaction coordinator coordinates the execution of the various

transactions (both local and global) initiated at that site.

The coordinator is responsible for,

1. Starting the execution of the transaction

2. Breaking the transaction into a number of sub transactions and distributing

these sub transactions to the appropriate sites for execution.

3. Coordinating the termination of the transaction, which may result in the

transaction being committed at all sites or aborted at all sites.

21

mc0077–advanced database systems-assignement-

Documents