normalization

Relational database designNormalization

Prepared by Vaishali Kalaria

Design Guidelines for Relational Databases What is relational database design?

The grouping of attributes to form "good" relation schemas

Two levels of relation schemas

The logical "user view" level The storage "base relation" level

Design is concerned mainly with base relations

What are the criteria for "good" base relations?

1. Semantics of the Relation Attributes

each tuple in a relation should represent one entity or relationship instance. (Applies to individual relations and their attributes).

Attributes of different entities should not be mixed in the same relation

Only foreign keys should be used to refer to other entities

Entity and relationship attributes should be kept apart as much as possible.

2. Redundancy and Data Anomalies

Redundant data is where we have stored the same ‘information’ more than once. i.e., the redundant data could be removed without the loss of information.

Wastes storage

Causes problems with update anomalies Insertion anomalies Deletion anomalies Modification anomalies

Design a schema that does not suffer from the insertion, deletion and update anomalies.

Example: the following relation that contains staff and department details:

Such ‘redundancy’ could lead to the following ‘anomalies’

staffNo job dept dname city

SL10 Salesman 10 Sales Stratford

SA51 Manager 20 Accounts Barking

DS40 Clerk 20 Accounts Barking

OS45 Clerk 30 Operations Barking

• Insert Anomaly: Need to store a value for an attribute but cannot because the value for another attribute is unknown. • We can’t insert a dept without inserting a member

of staff that works in that department

Update Anomaly: Occurs when a change of a single attribute in one record requires changes in multiple records• We could change the name of the dept that SA51

works in without simultaneously changing the dept that DS40 works in.

Deletion Anomaly: Occurs when the removal of a record results in a loss of important information about an entity.• By removing employee SL10 we have removed all

information pertaining to the Sales dept.

3 Null Values in Tuples

Relations should be designed such that their tuples will have as few NULL values as possible

Attributes that are NULL frequently could be placed in separate relations (with the primary key)

Reasons for nulls:

Attribute not applicable or invalid Attribute value unknown (may exist) Value known to exist, but unavailable

Purpose of Normalization

To avoid redundancy by storing each ‘fact’ within the database only once.

To put data into a form that conforms to relational principles - no repeating groups.

To put the data into a form that is more able to accurately accommodate change.

To avoid certain updating ‘anomalies’.

To facilitate the enforcement of data constraints.

Normalization

"Normalization" refers to the process of creating an efficient, reliable, flexible, and appropriate "relational" structure for storing information. Normalized data must be in a "relational" data structure.

Usually involves dividing a database into two or more tables and defining relationships between the tables.

The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships

The Process of Normalization

• Normalization is often executed as a series of steps. • Each step corresponds to a specific normal

form that has known properties.

• As normalization proceeds, • the relations become progressively more

restricted in format, and • less vulnerable to update anomalies.

Unnormalised (UDF)

First normal form(1NF)

Remove repeating groups

Second normal form(2NF)

Remove partial dependencies

Third normal form(3NF)

Remove transitive dependencies

Boyce-Codd normalform (BCNF)

Remove remaining functional dependency anomalies

Fourth normal form(4NF)

Remove multivalued dependencies

Fifth normal form(5NF)

Remove remaining anomalies

Stages of Normalisation

Unnormalized Normal Form (UNF)

Definition: A relation is unnormalized when it has not had any normalization rules applied to it, and it suffers from various anomalies.

the capturing of attributes to a ‘Universal Relation’ from a screen layout, manual report, manual document, etc...

ClientRental relation in UNF

Unnormalized form (UNF)A table that contains one or more repeating groups.

ClientNo cName propertyNo pAddress rentStart rentFinish rent ownerNo oName

CR76Johnkay

PG4

PG16

6 lawrenceSt,Glasgow

5 Novar Dr,Glasgow

1-Jul-00

1-Sep-02

31-Aug-01

1-Sep-02

350

450

CO40

CO93

Tina Murphy

Tony Shaw

CR56AlineStewart

PG4

PG36

PG16


2 Manor Rd,Glasgow

5 Novar Dr,Glasgow

1-Sep-99

10-Oct-00

1-Nov-02

10-Jun-00

1-Dec-01

1-Aug-03

350

370

450

CO40

CO93

CO93

Tina Murphy

Tony Shaw

Tony Shaw

Figure ClientRental unnormalized table

Repeating group = (propertyNo, pAddress, rentStart, rentFinish, rent, ownerNo, oName)

First Normal Form (1NF)

Definition: A relation is in 1NF if, and only if, all its underlying attributes contain atomic values only. the intersection of each row and column contains one

and only one value.

Remove repeating groups into a new relation

1NF disallows having a set of values, a tuple of values, or a combination of both as an attribute value

for a single tuple.

1NF

There are two approaches to removing repeating groups from

unnormalized tables:

1. Removes the repeating groups by entering appropriate data in the empty columns of rows containing the repeating data.

2. Removes the repeating group by placing the repeating data, along with a copy of the original key attribute(s), in a separate relation. A primary key is identified for the new relation.

1NF ClientRental relation with the first approach

ClientNo propertyNo cName pAddress rentStart rentFinish rent ownerNo oName

CR76 PG4JohnKay


1-Jul-00 31-Aug-01 350 CO40Tina Murphy

CR76 PG16JohnKay

5 Novar Dr,Glasgow

1-Sep-02 1-Sep-02 450 CO93Tony Shaw

CR56 PG4AlineStewart


1-Sep-99 10-Jun-00 350 CO40Tina Murphy


2 Manor Rd,Glasgow

10-Oct-00 1-Dec-01 370 CO93Tony Shaw


5 Novar Dr,Glasgow

1-Nov-02 1-Aug-03 450 CO93Tony Shaw

Figure 1NF ClientRental relation with the first approach

The ClientRental relation is defined as follows,ClientRental ( clientNo, propertyNo, cName, pAddress, rentStart, rentFinish, rent, ownerNo, oName)

With the first approach, we remove the repeating group(property rented details) by entering the appropriate client data into each row.

1NF ClientRental relation with the second approach

With the second approach, we remove the repeating group (property rented details) by placing the repeating data along with a copy of the original key attribute (clientNo) in a separte relation.

Client (clientNo, cName)PropertyRentalOwner (clientNo, propertyNo, pAddress, rentStart,

rentFinish, rent, ownerNo, oName)

ClientNo cName

CR76 John Kay

CR56 Aline Stewart

ClientNo propertyNo pAddress rentStart rentFinish rent ownerNo oName

CR76 PG46 lawrenceSt,Glasgow

1-Jul-00 31-Aug-01 350 CO40Tina Murphy

CR76 PG165 Novar Dr,Glasgow

1-Sep-02 1-Sep-02 450 CO93Tony Shaw

CR56 PG46 lawrenceSt,Glasgow

1-Sep-99 10-Jun-00 350 CO40Tina Murphy

CR56 PG362 Manor Rd,Glasgow

10-Oct-00 1-Dec-01 370 CO93Tony Shaw

CR56 PG165 Novar Dr,Glasgow

1-Nov-02 1-Aug-03 450 CO93Tony Shaw

Figure 1NF ClientRental relation with the second approach

Other Example

Second Normal Form (2NF)

A database table is said to be in 2NF if it is in 1NF and contains only those fields/columns that are

functionally dependent on the primary key.

In 2NF the partial dependencies can be removed of any non-key field.

Note: It is still possible for a table in 2NF to exhibit

transitive dependency; that is, one or more attributes may be functionally dependent on nonkey attributes.

The process of converting the database table into 2NF:

Identify the primary key for the 1NF relation.

Identify the functional dependencies in the relation.

If partial dependencies exist on the primary key remove them by placing then in a new relation along with a copy of their determinant.

2NF ClientRental relation

The ClientRental relation has the following functional dependencies:

fd1 clientNo, propertyNo rentStart, rentFinish (Primary Key)fd2 clientNo cName (Partial

dependency)fd3 propertyNo pAddress, rent, ownerNo, oName (Partial dependency)fd4 ownerNo oName (Full dependency)fd5 clientNo, rentStart propertyNo, pAddress, rentFinish, rent, ownerNo, oName (Candidate key)fd6 propertyNo, rentStart clientNo, cName, rentFinish (Candidate key)


After removing the partial dependencies, the creation of the three new relations called Client, Rental, and PropertyOwner

ClientNo cNameCR76 John Kay

CR56 Aline Stewart

Client

ClientNo propertyNo rentStart rentFinishCR76 PG4 1-Jul-00 31-Aug-01

CR76 PG16 1-Sep-02 1-Sep-02

CR56 PG4 1-Sep-99 10-Jun-00

CR56 PG36 10-Oct-00 1-Dec-01

CR56 PG16 1-Nov-02 1-Aug-03

Rental

propertyNo pAddress rent ownerNo oName

PG4 6 lawrence St,Glasgow 350 CO40 Tina Murphy

PG16 5 Novar Dr, Glasgow 450 CO93 Tony Shaw

PG36 2 Manor Rd, Glasgow 370 CO93 Tony Shaw

PropertyOwner

Client (clientNo, cName)Rental (clientNo, propertyNo, rentStart, rentFinish)PropertyOwner (propertyNo, pAddress, rent, ownerNo, oName)

Figure 2NF ClientRental relation

Third Normal Form (3NF)

Transitive dependency A condition where A, B, and C are attributes of a relation such thatif A B and B C, then C is transitively dependent on A via B(provided that A is not functionally dependent on B or C).

Third normal form (3NF)

A relation that is in first and second normal form, and in which no non-primary-key attribute is transitively dependent on the primary key.

The normalization of 2NF relations to 3NF involves the removal of transitive dependencies by placing the attribute(s) in a new relation along with a copy of the determinant.


The functional dependencies for the Client, Rental and PropertyOwner relations are as follows:

Clientfd2 clientNo cName (Primary Key)

Rentalfd1 clientNo, propertyNo rentStart, rentFinish (Primary Key)fd5 clientNo, rentStart propertyNo, rentFinish(Candidate key)fd6 propertyNo, rentStart clientNo, rentFinish (Candidate key)

PropertyOwnerfd3 propertyNo pAddress, rent, ownerNo, oName (Primary Key)fd4 ownerNo oName (Transitive Dependency)


The resulting 3NF relations have the forms:

Client (clientNo, cName)Rental (clientNo, propertyNo, rentStart, rentFinish)PropertyOwner (propertyNo, pAddress, rent, ownerNo)Owner (ownerNo, oName)

ClientNo cName

CR76 John Kay

CR56 Aline Stewart

Client

ClientNo propertyNo rentStart rentFinish

CR76 PG4 1-Jul-00 31-Aug-01

CR76 PG16 1-Sep-02 1-Sep-02

CR56 PG4 1-Sep-99 10-Jun-00

CR56 PG36 10-Oct-00 1-Dec-01

CR56 PG16 1-Nov-02 1-Aug-03

Rental

propertyNo pAddress rent ownerNo

PG4 6 lawrence St,Glasgow 350 CO40

PG16 5 Novar Dr, Glasgow 450 CO93

PG36 2 Manor Rd, Glasgow 370 CO93

PropertyOwner


ownerNo oName

CO40 Tina Murphy

CO93 Tony Shaw

Owner

Figure 3NF ClientRental relation

Boyce-Codd Normal Form (BCNF)

A relation is in BCNF if, and only if, every determinant is a candidate key.

BCNF is a refinement to third normal form,

A relation schema R is in Boyce-Codd Normal Form (BCNF) if whenever an FD X -> A holds in R, then X is a superkey of R

That is every relation in BCNF is also in 3NF but a relation in 3NF is not necessary in BCNF.

3NF to BCNF

Identify all candidate keys in the relation.

Identify all functional dependencies in the relation.

If functional dependencies exists in the relation where their determinants are not candidate keys for the relation, remove the functional dependencies by placing them in a new relation along with a copy of their determinant.

Example of BCNF

fd1 clientNo, interviewDate interviewTime, staffNo, roomNo

(Primary Key)

fd2 staffNo, interviewDate, interviewTime clientNo

(Candidate key)

fd3 roomNo, interviewDate, interviewTime clientNo, staffNo

(Candidate key)

fd4 staffNo, interviewDate roomNo (not a

candidate key)

ClientNo interviewDate interviewTime staffNo roomNo

CR76 13-May-02 10.30 SG5 G101

CR75 13-May-02 12.00 SG5 G101

CR74 13-May-02 12.00 SG37 G102

CR56 1-Jul-02 10.30 SG5 G102

Figure ClientInterview relation

ClientInterview

Example of BCNF(2)

To transform the ClientInterview relation to BCNF, we must remove the violating functional dependency by creating two new relations called Interview and SatffRoom as shown below,

Interview (clientNo, interviewDate, interviewTime, staffNo)StaffRoom(staffNo, interviewDate, roomNo)

ClientNo interviewDate interviewTime staffNoCR76 13-May-02 10.30 SG5

CR75 13-May-02 12.00 SG5

CR74 13-May-02 12.00 SG37

CR56 1-Jul-02 10.30 SG5

staffNo interviewDate roomNoSG5 13-May-02 G101

SG37 13-May-02 G102

SG5 1-Jul-02 G102

Interview

StaffRoom

Figure BCNF Interview and StaffRoom relations

Example

Example - UNF to 1NF Relation

Example - 1NF to 2NF

1NF: Property_Inspection (Property_No, IDate, ITime,Paddress, Comments, Staff_No, Sname, Car_Reg)

Full Functional Dependency:(Property_No+IDate)->(ITime, Comments, Staff_No,Sname, Car_Reg)

Partial Dependency:(Property_No+IDate)->(PAddress)

2NF: Prop (Property_No, Paddress) Prop_Inspection (Property_No, IDate, ITime, Comments, Staff_No, Sname, Car_Reg)

Example - 2NF to 3NF

Transitive Dependency in Prop_Inspect: (Property_No+IDate) -> Staff_No Staff_No -> Sname

3NF: Staff (Staff_No, Sname) Prop_Inspection (Property_No, IDate, ITime, Comments, Staff_No, Car_Reg)

Example - 3NF to BCNF

Prop (Property_No,Paddress) Staff (Staff_No, Sname) Prop_Inspection (Property_No, IDate, ITime, Comments,

Staff_No, Car_Reg)

Prop and Staff are already in BCNF.

FDs of Prop_Inspect: (Property_No, IDate)->(ITime, Comments, Staff_No,

Car_Reg) (Staff_No, Idate) -> Car_Reg (Car_Reg, Idate, ITime) -> (Property_No, Comments,

Staff_No) (Staff_No, Idate, ITime) -> (Property_No, Comments)

Example – BCNF

Prop (Property_No,Paddress)

Staff (Staff_No, Sname)

Inspection (Property_No, IDate, ITime, Comments, Staff_No)

Staff_Car (Staff_No, IDate, Car_Reg)

What is Decomposition?

Decomposition – the process of breaking down in parts or elements.

Decomposition in database means breaking tables down into multiple tables

From Database perspective means going to a higher normal form

To break the modules to in smallest one to convert the data models in to a normal forms to avoid redundancies

Decomposition of relation schema

Suppose R is a relation schema R = {A1,A2,A3,….An}

This is decompose into a set of relational schemas by D = {R1,R2,R3,…Rm } ,such that

Ri ⊆ R for 1<= i <=m And R1 ⋃ R2 ⋃ R3….⋃ Rm = R

Ex: gradeInfo(rollNo, studName, course, grade) R1 : gradeInfo(rollNo, course, grade) R2 : studInfo(rollNo, studName)

Process of Decomposition

Decomposition

Important that decompositions are “good”,

Two Characteristics of Good Decompositions

1) Lossless

2) Preserve dependencies

Problem with Decomposition

Given instances of the decomposed

relations,

we may not be able to reconstruct the

corresponding instance of the original relation

– information loss

Example : Problem with Decomposition

Model Name Price Category

a11 100 Canon

s20 200 Nikon

a70 150 Canon

R

Model Name Category

a11 Canon

s20 Nikon

a70 Canon

Price Category

100 Canon

200 Nikon

150 Canon

R1 R2

Example : Problem with Decomposition

R1 U R2


a11 100 Canon

a11 150 Canon

s20 200 Nikon

a70 100 Canon

a70 150 Canon


a11 100 Canon

s20 200 Nikon

a70 150 Canon

R

Lossy decomposition

In previous example, additional tuples are obtained along with original tuples

Although there are more tuples, this leads to less information

Due to the loss of information, decomposition for previous example is called lossy decomposition or lossy-join decomposition

Lossy decomposition (more example)

Employee Project Branch

Brown Mars L.A.

Green Jupiter San Jose

Green Venus San Jose

Hoskins Saturn San Jose

Hoskins Venus San Jose

T

Functional dependencies:

Employee Branch, Project Branch

Lossy decomposition

Decomposition of the previous relation

Employee Branch

Brown L.A

Green San Jose

Hoskins San Jose

Project Branch

Mars L.A.

Jupiter San Jose

Saturn San Jose

Venus San Jose

T1 T2

Lossy decomposition


Brown Mars L.A.





Green Saturn San Jose

Hoskins Jupiter San Jose


Brown Mars L.A.





After Natural Join Original Relation

After Natural Join, we get two extra tuples. Thus, there is loss of information.

What is lossless?

Lossless means functioning without a loss.In other words, retain everything.

Important for databases to have this feature.

Lossless Decomposition Property

R : relationF : set of functional dependencies on RX,Y : decomposition of RDecomposition is lossles if :

X ∩ Y X, that is: all attributes common to both X and Y functionally determine ALL the attributes in X

OR X ∩ Y Y, that is: all attributes common to both X

and Y functionally determine ALL the attributes in Y

In other words, if X ∩ Y forms a superkey of either X or Y, the decomposition of R is a lossless decomposition

Why lossless?

Ensures that attributes involved in the natural join (X ∩ Y) are a candidate key for at least one of the two relations.

This ensures we can never get the situation where false tuples are generated,

as for any value on the join attributes there will be a unique tuple in one of the relations.

A decomposition is lossless if we can recover: R(A,B,C)

R1(A,B) R2(A,C)

R’(A,B,C) should be the same as R(A,B,C)

Must ensure R’ = R

Decompose

Recover

Lossless Decomposition

Lossless Decomposition example

• Sometimes the same set of data is reproduced:

• (Word, 100) + (Word, WP) (Word, 100, WP)• (Oracle, 1000) + (Oracle, DB) (Oracle, 1000, DB)• (Access, 100) + (Access, DB) (Access, 100, DB)

Name Price Category

Word 100 WP

Oracle 1000 DB

Access 100 DB

Name Price

Word 100

Oracle 1000

Access 100

Name Category

Word WP

Oracle DB

Access DB

Lossy Decomposition• Sometimes it’s not:

• (Word, WP) + (100, WP) = (Word, 100, WP)• (Oracle, DB) + (1000, DB) = (Oracle, 1000, DB)• (Oracle, DB) + (100, DB) = (Oracle, 100, DB)• (Access, DB) + (1000, DB) = (Access, 1000, DB)• (Access, DB) + (100, DB) = (Access, 100, DB)

Name Price Category

Word 100 WP

Oracle 1000 DB

Access 100 DB

Category Name

WP Word

DB Oracle

DB Access

Category Price

WP 100

DB 1000

DB 100

What’swrong?

Ensuring lossless decomposition

R(A1, ..., An, B1, ..., Bm, C1, ..., Cp) R(A1, ..., An, B1, ..., Bm, C1, ..., Cp)

If A1, ..., An B1, ..., Bm or A1, ..., An C1, ..., Cp

Then the decomposition is lossless

R1(A1, ..., An, B1, ..., Bm)R1(A1, ..., An, B1, ..., Bm) R2(A1, ..., An, C1, ..., Cp)R2(A1, ..., An, C1, ..., Cp)

Note: don’t need both

Dependency preservation

Dependency preservation refers to a specific case of lossless decomposition, such that the normalized relvars are independent of each other

Some lossless decompositions do not exhibit dependency preservation

Let relation R(A,B,C,D) that has dependencies F that include A ➙ B and A ➙ C decomposition: R1(A,B), R2(B,C,D) A ➙ C can not be preserved using only one

relation.

Not possible to preserve each and every dependency in F

But dependency that are preserved are equivalent to F

F dependency of Relation R R decompose in R1,R2,….Rn Dependency partition of F are F1,F2,….,Fn only involve

attributes of R1,R2,..,Rn respectively then

Decomposition have Preserved Dependencies F1⋃ F2 ⋃ .. ⋃ Fn ➙ F

If decomposition does not preserve the dependency than decomposed relation do not satisfy the F or updation may require the join operation to check

Dependency Preserving Decompositions (Contd.)

Decomposition of R into X and Y is dependency preserving

if (FX FY ) + = F +

i.e., if we consider only dependencies in the closure F + that can be checked in X without considering Y, and in Y without considering X, these imply all dependencies in F +.

Important to consider F + in this definition: ABC, A B, B C, C A, decomposed into AB and BC. Is this dependency preserving? Is C A preserved?????

note: F + contains F {A C, B A, C B}, so…

FAB contains A B and B A; FBC contains B C and C B So, (FAB FBC)+ contains C A

Dependency Preservation

Example: decompose supplier, city, status where supplier implies city and status, and city and status imply each other

Dependency is preserved in this projection:SC {S#, CITY}CS {CITY, STATUS}

Dependency is not preserved in this one:SC {S#, CITY}CS {S#, STATUS}

Although the second is nonloss, you still cannot update them independently

Dependency Preservation

Ensures we can “easily” check whether a FD X Y is violated during an update to a database:

The projection of an FD set F onto a set of attributes Z, FZ is

{X Y | X Y F +, X Y Í Z}i.e., it is those FDs local to Z’s attributes

A decomposition R1, …, Rk is dependency preserving if F + = (FR1 ... FRk)+

The decomposition hasn’t “lost” any essential FD’s, so we can check without doing a join

Example of Lossless and Dependency-Preserving Decompositions

Given relation scheme R(cno, name, street, city, st, zip, item, price)

And FD set cno namename street, citystreet, city ststreet, city zipname, item price

Consider the decomposition R1(cno, name, street, city, st, zip) and R2(cno, name, item, price) Is it lossless? Is it dependency preserving?

What if we replaced the first FD by name, street city?

Comparison of BCNF and 3NF

It is always possible to decompose a relation into a set of relations that are in 3NF such that: the decomposition is lossless the dependencies are preserved

It is always possible to decompose a relation into a set of relations that are in BCNF such that: the decomposition is lossless it may not be possible to preserve dependencies.

normalization

Education

relation attributes

data anomalies redundant

specific normal form

clientrental relation

anomalies ds40clerk

following relation

new relation

relationship attributes