mc0077–advanced database systems-assignement-
DESCRIPTION
sms groupTRANSCRIPT
Master of Computer Application (MCA) – Semester 4
MC0077 – Advanced Database Systems
Assignment Set – 1
1. Explain the following normal forms with a suitable example
demonstrating the reduction of a sample table into the said
normal forms:
A) First Normal Form
B) B) Second Normal Form
C) C) Third Normal Form
Ans:-
The normal forms defined in relational database theory represent
guidelines for record design. The guidelines corresponding to first through fifth
normal forms are presented here, in terms that do not require an understanding of
relational theory. The design guidelines are meaningful even if one is not using a
relational database system. We present the guidelines without referring to the
concepts of the relational model in order to emphasize their generality, and also to
make them easier to understand. Our presentation conveys an intuitive sense of the
intended constraints on record design, although in its informality it may be
imprecise in some technical details. A comprehensive treatment of the subject is
provided by Date [4].
1
The normalization rules are designed to prevent update anomalies and data
inconsistencies. With respect to performance tradeoffs, these guidelines are biased
toward the assumption that all non-key fields will be updated frequently. They tend
to penalize retrieval, since data which may have been retrievable from one record
in an unnormalized design may have to be retrieved from several records in the
normalized form. There is no obligation to fully normalize all records when actual
performance requirements are taken into account.
2 FIRST NORMAL FORM
First normal form [1] deals with the "shape" of a record type.
Under first normal form, all occurrences of a record type must contain the same
number of fields.
First normal form excludes variable repeating fields and groups. This is not so
much a design guideline as a matter of definition. Relational database theory
doesn't deal with records having a variable number of fields.
3 SECOND AND THIRD NORMAL FORMS
Second and third normal forms [2, 3, 7] deal with the relationship between non-key
and key fields.
Under second and third normal forms, a non-key field must provide a fact about
the key, us the whole key, and nothing but the key. In addition, the record must
satisfy first normal form.
We deal now only with "single-valued" facts. The fact could be a one-to-many
relationship, such as the department of an employee, or a one-to-one relationship,
such as the spouse of an employee. Thus the phrase "Y is a fact about X" signifies
a one-to-one or one-to-many relationship between Y and X. In the general case, Y
2
might consist of one or more fields, and so might X. In the following example,
QUANTITY is a fact about the combination of PART and WAREHOUSE.
3.1 Second Normal Form
Second normal form is violated when a non-key field is a fact about a subset of a
key. It is only relevant when the key is composite, i.e., consists of several fields.
Consider the following inventory record:
---------------------------------------------------
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
====================-------------------------------
The key here consists of the PART and WAREHOUSE fields together, but
WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic
problems with this design are:
The warehouse address is repeated in every record that refers to a part stored
in that warehouse.
If the address of the warehouse changes, every record referring to a part
stored in that warehouse must be updated.
Because of the redundancy, the data might become inconsistent, with
different records showing different addresses for the same warehouse.
If at some point in time there are no parts stored in the warehouse, there may
be no record in which to keep the warehouse's address.
To satisfy second normal form, the record shown above should be decomposed
into (replaced by) the two records:
------------------------------- ---------------------------------
| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |
====================----------- =============--------------------
3
When a data design is changed in this way, replacing unnormalized records with
normalized records, the process is referred to as normalization. The term
"normalization" is sometimes used relative to a particular normal form. Thus a set
of records may be normalized with respect to second normal form but not with
respect to third.
The normalized design enhances the integrity of the data, by minimizing
redundancy and inconsistency, but at some possible performance cost for certain
retrieval applications. Consider an application that wants the addresses of all
warehouses stocking a certain part. In the unnormalized form, the application
searches one record type. With the normalized design, the application has to search
two record types, and connect the appropriate pairs.
3.2 Third Normal Form
Third normal form is violated when a non-key field is a fact about another non-key
field, as in
------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------
The EMPLOYEE field is the key. If each department is located in one place, then
the LOCATION field is a fact about the DEPARTMENT -- in addition to being a
fact about the EMPLOYEE. The problems with this design are the same as those
caused by violations of second normal form:
The department's location is repeated in the record of every employee
assigned to that department.
If the location of the department changes, every such record must be
updated.
Because of the redundancy, the data might become inconsistent, with
different records showing different locations for the same department.
4
If a department has no employees, there may be no record in which to keep
the department's location.
To satisfy third normal form, the record shown above should be decomposed into
the two records:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
2. Explain the concept of a Query? How a Query Optimizer works.
Ans:
A database query can be either a select query or an action query. A select
query is simply a data retrieval query. An action query can ask for additional
operations on the data, such as insertion, updating, or deletion.
The aim of query processing is to find information in one or more databases
and deliver it to the user quickly and efficiently. Traditional techniques work well
for databases with standard, single-site relational structures, but databases
containing more complex and diverse types of data demand new query processing
and optimization techniques. Most real-world data is not well structured. Today’s
databases typically contain much non-structured data such as text, images, video,
and audio, often distributed across computer networks. In this complex milieu
(typified by the World Wide Web), efficient and accurate query processing
becomes quite challenging. Principles of Database Query Processing for Advanced
Applications teaches the basic concepts and techniques of query processing and
optimization for a variety of data forms and database systems, whether structured
or unstructured.
5
Query Optimizer
The Query Optimizer is the component of a database management system that
attempts to determine the most efficient way to execute a query. The optimizer
considers the possible query plans (discussed below) for a given input query, and
attempts to determine which of those plans will be the most efficient. Cost-based
query optimizers assign an estimated "cost" to each possible query plan, and
choose the plan with the least cost. Costs are used to estimate the runtime cost of
evaluating the query, in terms of the number of I/O operations required, the CPU
requirements, and other factors.
Query plan
A Query Plan (or Query Execution Plan) is a set of steps used to access
information in a SQL relational database management system. This is a specific
case of the relational model concept of access plans. Since SQL is declarative,
there are typically a large number of alternative ways to execute a given query,
with widely varying performance. When a query is submitted to the database, the
query optimizer evaluates some of the different, correct possible plans for
executing the query and returns what it considers the best alternative. Because
query optimizers are imperfect, database users and administrators sometimes need
to manually examine and tune the plans produced by the optimizer to get better
performance.
The set of query plans examined is formed by examining the possible access paths
(e.g. index scan, sequential scan) and join algorithms (e.g. sort-merge join, hash
join, nested loops). The search space can become quite large depending on the
complexity of the SQL query.
6
The query optimizer cannot be accessed directly by users. Instead, once queries are
submitted to database server, and parsed by the parser, they are then passed to the
query optimizer where optimization occurs.
How to Query optimizer work
The MySQL query optimizer has several goals, but its primary aims are to use
indexes whenever possible and to use the most restrictive index in order to
eliminate as many rows as possible as soon as possible. That last part might sound
backward and nonintuitive. After all, your goal in issuing a SELECT statement is
to find rows, not to reject them. The reason the optimizer tries to reject rows is that
the faster it can eliminate rows from consideration, the more quickly the rows that
do match your criteria can be found. Queries can be processed more quickly if the
most restrictive tests can be done first. Suppose that you have a query that tests two
columns, each of which has an index on it:
SELECT col3 FROM mytable
WHERE col1 = 'some value' AND col2 = 'some other value';
Suppose also that the test on col1 matches 900 rows, the test on col2 matches 300
rows, and that both tests together succeed on 30 rows. Testing col1 first results in
900 rows that must be examined to find the 30 that also match the col2 value.
That's 870 failed tests. Testing col2 first results in 300 rows that must be examined
to find the 30 that also match the col1 value. That's only 270 failed tests, so less
computation and disk I/O is required. As a result, the optimizer will testcol2 first
because doing so results in less work overall.
You can help the optimizer take advantage of indexes by using the following
guidelines:
Try to compare columns that have the same data type. When you use indexed
columns in comparisons, use columns that are of the same type. Identical data
7
types will give you better performance than dissimilar types. For example, INT is
different from BIGINT. CHAR(10) is considered the same
as CHAR(10) or VARCHAR(10) but different
from CHAR(12) or VARCHAR(12). If the columns you're comparing have
different types, you can use ALTER TABLE to modify one of them so that the
types match.
Try to make indexed columns stand alone in comparison expressions. If you
use a column in a function call or as part of a more complex term in an arithmetic
expression, MySQL can't use the index because it must compute the value of the
expression for every row. Sometimes this is unavoidable, but many times you can
rewrite a query to get the indexed column to appear by itself.
The following WHERE clauses illustrate how this works. They are equivalent
arithmetically, but quite different for optimization purposes:
WHERE mycol < 4 / 2
WHERE mycol * 2 < 4
For the first line, the optimizer simplifies the expression 4/2 to the value 2, and
then uses an index on mycol to quickly find values less than 2. For the second
expression, MySQL must retrieve the value of mycol for each row, multiply by 2,
and then compare the result to 4. In this case, no index can be used. Each value in
the column must be retrieved so that the expression on the left side of the
comparison can be evaluated.
Let's consider another example. Suppose that you have an indexed
column date_col. If you issue a query such as the one following, the index isn't
used:
SELECT * FROM mytbl WHERE YEAR(date_col) < 1990;
8
The expression doesn't compare 1990 to an indexed column; it compares 1990 to a
value calculated from the column, and that value must be computed for each row.
As a result, the index on date_col is not used because performing the query
requires a full table scan. What's the fix? Just use a literal date, and then the index
on date_col can be used to find matching values in the columns:
WHERE date_col < '1990-01-01'
But suppose that you don't have a specific date. You might be interested instead in
finding records that have a date that lies within a certain number of days from
today. There are several ways to express a comparison of this type—not all of
which are equally efficient. Here are three possibilities:
WHERE TO_DAYS(date_col) - TO_DAYS(CURDATE()) < cutoff
WHERE TO_DAYS(date_col) < cutoff + TO_DAYS(CURDATE())
WHERE date_col < DATE_ADD(CURDATE(), INTERVAL cutoff DAY)
For the first line, no index is used because the column must be retrieved for each
row so that the value of TO_DAYS(date_col) can be computed. The second line is
better. Both cutoff andTO_DAYS(CURDATE()) are constants, so the right-hand
side of the comparison can be calculated by the optimizer once before processing
the query, rather than once per row. But the date_col column still appears in a
function call, preventing use of the index. The third line is best of all. Again, the
right-hand side of the comparison can be computed once as a constant before
executing the query, but now the value is a date. That value can be compared
directly to date_col values, which no longer need to be converted to days. In this
case, the index can be used.
Don't use wildcards at the beginning of a LIKE pattern. Some string searches
use a WHEREclause of the following form:
WHERE col_name LIKE '%string%'
9
That's the correct thing to do if you want to find a string no matter where it occurs
in the column. But don't put '%' on both sides of the string simply out of habit. If
you're really looking for the string only when it occurs at the beginning of the
column, leave out the first '%'. Suppose that you're looking in a column containing
last names for names like MacGregor or MacDougall that begin with'Mac'. In that
case, write the WHERE clause like this:
WHERE last_name LIKE 'Mac%'
The optimizer looks at the literal initial part of the pattern and uses the index to
find rows that match as though you'd written the following expression, which is in
a form that allows an index onlast_name to be used:
WHERE last_name >= 'Mac' AND last_name < 'Mad'
This optimization does not apply to pattern matches that
usethe REGEXP opeator. REGEXPexpressions are never optimized.
10
3. Explain the following with respect to Heuristics of Query
Optimizations:
A) Equivalence of Expressions
B) Selection Operation
C) Projection Operation
D) Natural Join Operation
Ans:
Equivalent expressions
We often want to replace a complicated expression with a simpler one that means
the same thing. For example, the expression x + 4 + 2 obviously means the same
thing as x + 6, since 4 + 2 = 6. More interestingly, the expression x + x + 4 means
the same thing as 2x + 4, because 2x is x + x when you think of multiplication as
repeated addition. (Which of these is simpler depends on your point of view, but
usually 2x + 4 is more convenient in Algebra.)
Two algebraic expressions are equivalent if they always lead to the same result
when you evaluate them, no matter what values you substitute for the variables.
For example, if you substitute x := 3 in x + x + 4, then you get 3 + 3 + 4, which
works out to 10; and if you substitute it in 2x + 4, then you get 2(3) + 4, which also
works out to 10. There's nothing special about 3 here; the same thing would happen
no matter what value we used, so x + x + 4 is equivalent to 2x + 4. (That's really
what I meant when I said that they mean the same thing.)
When I say that you get the same result, this includes the possibility that the result
is undefined. For example, 1/x + 1/x is equivalent to 2/x; even when you substitute
x := 0, they both come out the same (in this case, undefined). In contrast, x2/x is not
equivalent to x; they usually come out the same, but they are different when x := 0.
11
(Then x2/x is undefined, but x is 0.) To deal with this situation, there is a sort of
trick you can play, forcing the second expression to be undefined in certain cases.
Just add the words ‘for x ≠ 0’ at the end of the expression to make a new
expression; then the new expression is undefined unless x ≠ 0. (You can put any
other condition you like in place of x ≠ 0, whatever is appropriate in a given
situation.) So x2/x is equivalent to x for x ≠ 0.
To symbolise equivalent expressions, people often simply use an equals sign. For
example, they might say ‘x + x + 4 = 2x + 4’. The idea is that this is a statement
that is always true, no matter what x is. However, it isn't really correct to write
‘1/x + 1/x = 2/x’ to indicate an equivalence of expressions, because this statement
is not correct when x := 0. So instead, I will use the symbol ‘≡’, which you can
read ‘is equivalent to’ (instead of ‘is equal to’ for ‘=’). So I'll say, for example,
x + x + 4 ≡ 2x + 4,
1/x + 1/x ≡ 2/x, and
x2/x ≡ x for x ≠ 0.
The textbook, however, just uses ‘=’ for everything, so you can too, if you want.
Selection Operation
1. Consider the query to find the assets and branch-names of all banks who
have depositors living in Port Chester. In relational algebra, this is
2.
3. (CUSTOMER DEPOSIT BRANCH))
4.
o This expression constructs a huge relation,
12
o CUSTOMER DEPOSIT BRANCH
o
of which we are only interested in a few tuples.
o We also are only interested in two attributes of this relation.
o We can see that we only want tuples for which CCITY = ``PORT
CHESTER''.
o Thus we can rewrite our query as:
o
o DEPOSIT BRANCH)
o
o This should considerably reduce the size of the intermediate relation.
Projection Operation
1. Like selection, projection reduces the size of relations.
It is advantageous to apply projections early. Consider this form of our
example query:
2. When we compute the subexpression
3.
we obtain a relation whose scheme is
(CNAME, CCITY, BNAME, ACCOUNT#, BALANCE)
13
4. We can eliminate several attributes from this scheme. The only ones we
need to retain are those that
o appear in the result of the query or
o are needed to process subsequent operations.
5. By eliminating unneeded attributes, we reduce the number of columns of the
intermediate result, and thus its size.
6. In our example, the only attribute we need is BNAME (to join with
BRANCH). So we can rewrite our expression as:
7.
8.
9.
10.Note that there is no advantage in doing an early project on a relation before
it is needed for some other operation:
o We would access every block for the relation to remove attributes.
o Then we access every block of the reduced-size relation when it is
actually needed.
o We do more work in total, rather than less!
Natural Join Operation
1. Another way to reduce the size of temporary results is to choose an optimal
ordering of the join operations.
2. Natural join is associative:
3.
14
4. Although these expressions are equivalent, the costs of computing them may
differ.
o Look again at our expression
o
o
o
o we see that we can compute DEPOSIT BRANCH first and then join
with the first part.
o However, DEPOSIT BRANCH is likely to be a large relation as it
contains one tuple for every account.
o The other part,
o
is probably a small relation (comparatively).
o So, if we compute
o
first, we get a reasonably small relation.
o It has one tuple for each account held by a resident of Port Chester.
o This temporary relation is much smaller than DEPOSIT BRANCH.
5. Natural join is commutative:
6.
o Thus we could rewrite our relational algebra expression as:
o
o
15
o
o But there are no common attributes between CUSTOMER and
BRANCH, so this is a Cartesian product.
o Lots of tuples!
o If a user entered this expression, we would want to use the
associativity and commutativity of natural join to transform this into
the more efficient expression we have derived earlier (join with
DEPOSIT first, then with BRANCH).
Que 4. There are a number of historical, organizational, and technological
reasons explain the lack of an all-encompassing data management system.
Discuss few of them with appropriate examples.
Ans:
A number of historical, organizational, and technological reasons explain the lack of an all-encompassing data management system. Among these are:
· The sensible advice – to build small systems with the plan to extend their scope in later implementation phases – allows a core system to be implemented relatively quickly, but has lead to a proliferation of relatively small systems.
· Department autonomy has led to construction of department specific rather than organization wide systems, again leading to many small, overlapping, and often incompatible systems within an organization.
· The continual evolution of the organization and its interactions both within and to its external environment prohibits complete understanding of future information requirements.
16
· Parallel development of data management systems for particular applications has lead to different and incompatible systems for management of tabular/administrative data, text/document data, historical/statistical data, spatial/geographic data, and streamed/audio and visual data.
The result is that only a portion of an organization’s data is administered by any one data management system and most organizations have a multitude of special purpose databases, managed by different, and often incompatible, data management system types. The growing need to retrieve data from multiple databases within an organization, as well as the rapid dissemination of data through the Internet, has given rise to the requirement of providing integrated access to both internal and external data of multiple types.
A major challenge and critical practical and research problem for the information, computer, and communication technology communities is to develop data management systems that can provide efficient access to the data stored in multiple private and public database Problems to be resolved include:
1. Interoperability among systems
2. Incorporation of legacy systems and
3. Integration of management techniques for structured and unstructured data
Each of the above problems entails an integration of concepts, methods, techniques and tools from separate research and development communities that have existed in parallel but independently and have had rather minimal interaction. One consequence of which is that there exist an overlapping and conflicting terminology between these communities. With this definition, no limitations are given as to the type of:
· Data in the collection,
· Model used to structure the collection, or
· Architecture and geographic location of the database
The focus of this text is on on-line – electronic and web accessible – databases containing multiple media data, thus restricting our interest/focus to multimedia databases stored on one or more computers (DB servers) and accessible from the Internet. Electronic databases are important since they contain data recording the products and services, as well as the economic history and current status of the owner organization. They are also a source of information for the organization’s employees and customers/users. However, databases can not be used effectively
17
unless there exist efficient and secure data management systems, DMS for the data in the databases.
Que.5. Describe the Structural Semantic Data Model (SSM) with relevant
examples.
ANS.
Modelling Complex and Multimedia Data
Data modelling addresses a need in information system analysis and design to
develop a model of the information requirements as well as a set of viable database
structure proposals. The data modelling process consists of:
1. Identifying and describing the information requirements for an information
system,
2. Specifying the data to be maintained by the data management system, and
3. Specifying the data structures to be used for data storage that best support
the information requirements.
A fundamental tool used in this process is the data model, which is used both for
specification of the information requirements at the user level and for specification
of the data structure for the database. During implementation of a database, the
data model guides construction of the schema or data catalog which contains the
metadata that describe the DB structure and data semantics that are used to support
database implementation and data retrieval.
18
Data modelling, using a specific data model type, and as a unique activity during
information system design, is commonly attributed to Charles Bachman (1969)
who presented the Data Structure Diagram as one of the first, widely used data
models for network database design. Several alternative data model types were
proposed shortly thereafter, the best known of which are the:
Relational model (Codd, 1970) and the
Entity-relationship, ER, model (Chen, 1976).
The relational model was quickly criticized for being 'flat' in the sense that all
information is represented as a set of tables with atomic cell values. The definition
of well-formed relational models requires that complex attribute types (hierarchic,
composite, multi-valued, and derived) be converted to atomic attributes and that
relations be normalized. Inter-entity (inter-relation) relationships are difficult to
visualize in the resulting set of relations, making control of the completeness and
correctness of the model difficult. The relational model maps easily to the physical
characteristics of electronic storage media, and as such, is a good tool for design of
the physical database.
The entity-relationship approach to modelling, proposed by Chen (1976), had two
primary objectives: first to visualize inter-entity relationships and second to
separate the DB design process into two phases:
1. Record, in an ER model, the entities and inter-entity relationships required
"by the enterprise", i.e. by the owner/user of the information system or
application. This phase and its resulting model should be independent of the
DBMS tool that is to be used for realizing the DB.
19
2. Translate the ER model to the data model supported by the DBMS to be
used for implementation.
This two-phase design supports modification at the physical level without
requiring changes to the enterprise or user view of the DB content.
Also Chen's ER model quickly came under criticism, particularly for its lack of
ability to model classification structures. In 1977, (Smith & Smith) presented a
method for modelling generalization and aggregation hierarchies that underlie the
many extended/enhanced entity-relationship, EER, model types proposed and in
use today.
Que.6. What are differences in Global and Local Transactions in
distributed database system? What are the roles of Transaction Manager
and Transaction Coordinator in managing transactions in distributed
database?
Ans:
Differences between Global and Local Transaction
A local transaction is one that accesses data only from sites where the transaction
originated. A global transaction, on the other hand, is one that either accesses data
in a site different from the one at which the transaction was initiated, or accessed
data in several different sites.
Roles of Transaction Manager and Transaction Coordinator in managing transactions
in distributed database
20
1. The Transaction Manager manages the execution of those transactions (or
sub transactions) that access data stored in a local site. Note that each such
transaction may be either a local transaction (that is, a transaction that
executes at several sites).
The Transaction manager is responsible for,
1. Maintaining a log for recovery purposes
2. Participating in an appropriate concurrency model scheme to coordinate the
concurrent execution of the transactions executing at that site.
1. The Transaction coordinator coordinates the execution of the various
transactions (both local and global) initiated at that site.
The coordinator is responsible for,
1. Starting the execution of the transaction
2. Breaking the transaction into a number of sub transactions and distributing
these sub transactions to the appropriate sites for execution.
3. Coordinating the termination of the transaction, which may result in the
transaction being committed at all sites or aborted at all sites.
21