csci1270 introduction to database...

45
CSCI1270: Introduction to Database Systems CSCI1270 Introduction to Database Systems Normalization

Upload: vuhuong

Post on 13-Mar-2018

239 views

Category:

Documents


1 download

TRANSCRIPT

CSCI1270: Introduction to Database Systems

CSCI1270 Introduction to Database Systems

Normalization

CSCI1270: Introduction to Database Systems

Another Use for FD’s: Schema Design

Schema Design: Approach #1

Schema Design: Approach #2

Schema Design: Approach #3

1. Construct E/R diagram 2. Translate into tables

Subjective: How do we know if any good?

1. Start with universal relation 2. Determine FD’s 3. “Decompose” UR using FD’s as guide

1. Construct E/R diagram to come up with 1st cut design 2. Use FD’s to verify or refine

CSCI1270: Introduction to Database Systems

Decomposition

1. Decomposing the Schema R = (bname, bcity, assets, cname, lno, amt)

R1 = (bname, bcity, assets, cname)

R2 = (cname, lno, amt)

Notation: R = R1 ∪ R2

CSCI1270: Introduction to Database Systems

Decomposition

1000 L-17 Hayes 9M Bkln Dntn 500 L-93 Jones 1.7M Hnck Mianus

2000 L-23 Johnson 9M Bkln Dntn 1000 L-17 Jones 9M Bkln Dntn amt lno cname assets bcity bname

Hayes 9M Bkln Dntn Jones 1.7M Hnck Mianus

Johnson 9M Bkln Dntn Jones 9M Bkln Dntn cname assets bcity bname

1000 L-17 Hayes 500 L-93 Jones 2000 L-23 Johnson 1000 L-17 Jones amt lno cname

2. Decomposing the Instance

R1 =

R =

R2 =

BTW: Not a Good Decomposition

CSCI1270: Introduction to Database Systems

Want to be able to reconstruct big relation by joining smaller ones (Natural join) (i.e.: R1 R2 = R?)

Goals of Decomposition 1.  Lossless Joins

2. Dependency Preservation

3. Redundancy Avoidance

Want to minimize the cost of global integrity constraints based on FD’s (i.e.: Avoid big joins in assertions)

Summary: LJ: Information loss DP: Efficiency (time) RA: Efficiency (space), update anomalies

Avoid unnecessary data dupl. (motivation for decomposition)

CSCI1270: Introduction to Database Systems

Another Use for FD’s: Schema Design

1000 L-17 Hayes 9M Bkln Dntn 500 L-93 Jones 1.7M Hnck Mianus 2000 L-23 Johnson 9M Bkln Dntn 1000 L-17 Jones 9M Bkln Dntn amt lno cname assets bcity bname

Example:

R: “Universal Relation”

Design:

R =

Tuple meaning: Jones has a loan (L-17) for $1000 taken out of the Dntn branch in Bkln which has assets of $9M

pro : Fast queries (No need for joins!) con : Redundancy, update anomalies, deletion anomalies

CSCI1270: Introduction to Database Systems

Decomposition Goal #1: Lossless Joins

Hayes 9M Bkln Dntn Jones 1.7M Hnck Mianus

Johnson 9M Bkln Dntn Jones 9M Bkln Dntn cname assets bcity bname

1000 L-17 Hayes 500 L-93 Jones 2000 L-23 Johnson 1000 L-17 Jones amt lno cname

=

A Bad Decomposition

1000 L-17 Hayes 9M Bkln Dntn 500 L-93 Jones 1.7M Hnck Mianus 1000 L-17 Jones 1.7M Hnck Mianus 3000 L-23 Johnson 9M Bkln Dntn 500 L-93 Jones 9M Bkln Dntn 1000 L-17 Jones 9M Bkln Dntn amt lno cname assets bcity bname

CSCI1270: Introduction to Database Systems

Decomposition Goal #1: Lossless Joins

1000 L-17 Hayes 9M Bkln Dntn 500 L-93 Jones 1.7M Hnck Mianus 1000 L-17 Jones 1.7M Hnck Mianus 3000 L-23 Johnson 9M Bkln Dntn 500 L-93 Jones 9M Bkln Dntn 1000 L-17 Jones 9M Bkln Dntn amt lno cname assets bcity bname

→ =

A Bad Decomposition

Problem:

“Lossy join”: By adding noise, have lost meaningful information as a result of decomposition

adds meaningless tuples

CSCI1270: Introduction to Database Systems

(R1 R2 has 7 tuples, whereas R has 4)

A: Lossy. R1 R2 includes:

Lossless Joins

Hayes 9M Bkln Dntn Jones 1.7M Hnck Mianus

Johnson 9M Bkln Dntn Jones 9M Bkln Dntn cname assets bcity bname

500 L-93 Mianus 2000 L-23 Dntn 1000 L-17 Dntn amt lno bname

… 2000 L-23 Hayes 9M Bkln Dntn 1000 L-17 Johnson 9M Bkln Dntn 2000 L-23 Jones 9M Bkln Dntn

… amt lno cname assets bcity bname

Is the Following Decomposition Lossless or Lossy? R1 = R2 =

CSCI1270: Introduction to Database Systems

Is the Following Decomposition Lossless or Lossy?

A: Lossless. R1 R2 has 4 tuples

Lossless Joins

L-17 Hayes 9M Dntn L-93 Jones 1.7M Mianus L-23 Johnson 9M Dntn L-17 Jones 9M Dntn lno cname assets bname

500 Hnck L-93 2000 Bkln L-23 1000 Bkln L-17 amt bcity lno R1 = R2 =

CSCI1270: Introduction to Database Systems

A: Lossless. R1 R2 has 4 tuples

Lossless Joins

Hayes 1000 L-17 Dntn Jones 500 L-93 Mianus

Johnson 2000 L-23 Dntn Jones 1000 L-17 Dntn cname amt lno bname

1.7M Bkln Mianus 9M Bkln Dntn

assets bcity bname

Lossless or Lossy?

Q: When is decomposition lossless?

R1 = R2 =

CSCI1270: Introduction to Database Systems

▪ A a key ⇒ |R2| ≤ n, & Relationship with R1 is n:1

Ensuring Lossless Joins

● ● ● ● ●

● ● ●

▪ A not a key ⇒ |R1| = n ∴ n tuples in result

A Decomposition of R, R = R1 ∪ R2 is Lossless iff

Intuition: Original relation R has n tuples

R1 R2 A A

R1 ∩ R2 → R1 or R1 ∩ R2 → R2

(i.e.: Intersecting atts must form a super key for one of the resulting smaller relations)

CSCI1270: Introduction to Database Systems

Decomposition Goal #2: Dependency Preservation

Goal: Efficient integrity checks of FD’s

An Example With No Dependency Preservation:

Decomposition: R = R1 ∪ R2

Lossless, but Not DP. Why?

R = (bname, bcity, assets, cname, lno, amt) bname → bcity assets lno → amt bname

R1 = (bname, assets, cname, lno) R2 = (lno, bcity, amt)

CSCI1270: Introduction to Database Systems

Decomposition Goal #2: Dependency Preservation (cont.)

CREATE ASSERTION bname-bcity CHECK NOT EXISTS (SELECT * FROM R1 AS x1, R2 AS y1,R1 AS x2, R2 AS y2 WHERE x1.lno = y1.lno AND x2.lno = y2.lno AND x1.lno = x2.lno AND x1.bname = x2.bname AND

y1.bcity <> y2.bcity)

Decomposition (cont.): R = R1 ∪ R2

Lossless, but Not DP. Why?

R1 = (bname, assets, cname, lno) R2 = (lno, bcity, amt)

A: bname → bcity crosses 2 tables

CSCI1270: Introduction to Database Systems

Decomposition Goal #2: Dependency Preservation

… A1 … An B1 … Bm …

To Ensure Best Possible Efficiency of FD Checks

Above: Ri “covers” the FD, A1, …, An → B1, …, Bm To Test if Decomposition R = R1 ∪ … Rn is DP,

Ensure that only a SINGLE table be examined for each FD i.e.: Ensure that A1, …, An → B1, …, Bm can be checked

by examining one table as in:

1. See which FD’s of R are covered by R1, …, Rn 2. Compare closure of (1) to closure of FD’s of R

Ri =

CSCI1270: Introduction to Database Systems

Decomposition Goal #2: Dependency Preservation

More Formally: To test if R = R1 ∪ … ∪ Rn is dependency preserving wrt R’s FD set, F:

1. Compute F+ 2. Compute G

G ← ∅ For i ← 1 to n DO

Add to G those FD’s in F+ covered by Ri

3. Compute G+

4. If F+ = G+: Decomposition is DP If F+ ≠ G+: Decomposition is not DP

CSCI1270: Introduction to Database Systems

Decomposition Goal #2: Dependency Preservation (cont.)

More Formally (cont.):

Example:

To test if R = R1 ∪ … ∪ Rn is dependency preserving wrt R’s FD set, F:

1. Compute F+ 2. Compute G 3. Compute G+ 4. Compute F+ - G+

F = {A → B, AB → D, C → D} R1 = (A, B, C); R2 = (C, D)

Is this decomposition of (A, B, C, D) DP?

CSCI1270: Introduction to Database Systems

Aside: Computing F+

Many Algorithms Call For It If you know Armstrong’s Axioms cold, can generate lazily:

1. Compute Fc 2.  Use Armstrong’s Axioms to derive (X → Y) ∈ Fc+ as

needed

CSCI1270: Introduction to Database Systems

Decomposition Goal #2: Dependency Preservation

Example: F = {A → B, AB → D, C → D} R1 = (A, B, C); R2 = (C, D)

Is R = R1 ∪ R2 DP?

A: 1. F+ = {A → B, AB → D, C → D}+ Note: (A → D) ∈ F+ 2. G = ∅ ∪ {A → B, …} ∪ {C → D, …} Note: (A → D) ∉ G 3. G+ = {…} Note: (A → D) ∉ G+ 4. F+ ≠ G+ because (A → D) ∈ (F+ - G+)

∴ Decomposition is not DP

CSCI1270: Introduction to Database Systems

Decomposition Goal #2: Dependency Preservation

Example:

Q: Does it satisfy lossless joins?

F = {A → B, AB → D, C → D} What is a DP decomposition of F?

A: R = R1 ∪ R2 s.t. R1 = (A, B, D); R2 = (C, D) 1. F+ = {A → B, AB → D,C → D}+ 2. G+ = {A → B,AB → D, C → D}+ 3. F+ = G+ Note: G+ cannot introduce FD’s not in F+

∴ Decomposition is DP

A: No

CSCI1270: Introduction to Database Systems

Decomposition Goals Summary

Lossless Joins

Dependency Preservation Motivation: Efficient FD assertions Idea: No gic’s require joins of more than 1 table with itself Test: R = R1 ∪ … ∪ Rn is DP if closure of FD’s covered by

each Ri = closure of FD’s covered by R = F+ Ensured for: 3NF

Motivation: Avoid information loss Idea: No noise introduced when reconstitution universal

relation via joins Test: At each decomposition test: R = R1 ∪ R2 (R1 ∩ R2) → R1 or (R1 ∩ R2) → R2 Ensured for: BCNF, 3NF

CSCI1270: Introduction to Database Systems

Decomposition Goal #3 Redundancy Avoidance

Redundancy:

1. Name FD of this relation? Ans: B → C

2. Name the super keys of this relation A: All sets of atts that include A

3. When do we have redundancy? A: When ∃ some FD, X → Y covered by

relation & X not a super key

1 z p

1 z n

2 y m

2 y h

2 y g

1 x e

1 x a

CB A

CSCI1270: Introduction to Database Systems

Redundancy Avoidance

Decomposition Goals Summary (cont.)

Motivation: Avoid update, deletion anomalies Idea: Avoid update anomalies, wasted space

Test: For any X → Y covered by Ri, X should be a superkey of Ri Ensured for: BCNF

CSCI1270: Introduction to Database Systems

Boyce-Codd Normal Form

What is a Normal Form?

BCNF:

Characterization of schema decomposition in terms of properties it satisfies

Guarantees no redundancy and lossless joins (Not DP!)

Defined: Relation schema R, with FD set F, is in BCNF if: For all nontrivial X → Y in F+: X → R (i.e.: X is a super key)

CSCI1270: Introduction to Database Systems

1. A → B, A → R (A is a key) 2. A → C, A → R (A is a key) 3. B → C, B → A (B is not a key)

BCNF

Example: R = (A, B, C) F = {A → B, B → C}

A: Consider the nontrivial dependencies in F+:

Therefore, R not in BCNF

Is R in BCNF?

CSCI1270: Introduction to Database Systems

BCNF

Example: R = R1 ∪ R2 R1 = (A, B); R2 = (B, C) F = {A → B, B → C}

A: 1. Test R1: A → B covered, A → R1 (all other FD’s covered trivial) 2. Test R2: B → C covered, B → R2 (all other FD’s covered trivial)

∴ R1, R2 in BCNF

Are R1, R2 in BCNF?

CSCI1270: Introduction to Database Systems

a. Choose (X → Y) ∈ F+ s.t. → (X → Y) covered by Ri → X → Ri

ALGORITHM BCNF (R: Relation, F: FD set) BEGIN

BCNF

Decomposition Algorithm

1. Compute F+

2. Result ← {R}

3. While some Ri ∈ Result not in BCNF, DO

b. Decompose Ri on (X → Y) Ri1 ← X ∪ Y Ri2 ← Ri – Y

c. Result ← Result – {Ri} ∪ {Ri1,Ri2} 4. Return result

END

CSCI1270: Introduction to Database Systems

Ri = {A, B, C, D, E) (B → CD) ∈ F+, B → Ri

BCNF

Decomposition Algorithm Each Step:

Decompose Ri that is not in BCNF

Ri1 = (B, C, D) Note: B → CD Covered, and

B → Ri1

Ri2 = (A, B, E)

Progress!

CSCI1270: Introduction to Database Systems

BCNF

Decomposition Algorithm (cont.)

Example:

R = (A, B, C, D) F = {A → B, AB → D, B → C}

1. Compute F+: F+ = {A → B, AB → D, B → C,

A → C, A → D, AB → C, AC → D, AD → C, ABC → D, ABD → C} + all trivial dep’s

Decompose R into BCNF?

CSCI1270: Introduction to Database Systems

Ri = {A, B, C, D) B → C covered, B → Ri

BCNF

Decomposition Algorithm (cont.)

R1 = (B, C) In BCNF

1. B → C & B → R1

R2 = (A, B, D) In BCNF 2. A → B, 3. AB → D, 4. A → D covered &

A → R2, AB → R2 ∴  Solution is R = R1 ∪ R2

CSCI1270: Introduction to Database Systems

BCNF

Note: This will suffice!

Find 2 decompositions, 1 DP and 1 not DP

R = (A, B, C, D, E, H) F = {A → BC, E → HA}

Decompose R into BCNF: F+ = {A → B, A → C, A → BC E → H, E → A, E → HA

E → B, E → C, E → BC E → HB, E → HC, E → AB E → AC, AE → …, ABE → …, ACE → …, ADE → …, …} + all trivial dep’s

CSCI1270: Introduction to Database Systems

BCNF Decomposition

R = (A, B, C, D, E, H) F = {A → BC, E → HA}

Decomposition #1: R = R1 ∪ R3 ∪ R4

Q: Is this DP? A: Yes. All Fc covered by R1, R3, R4. Therefore F+ covered

R = (A, B, C, D, E, H) Decompose on A → BC

R1 = (A, B, C) R2 = (A, D, E, H) Decompose on E → HA

R4 = (D, E) R3 = (A, E, H)

CSCI1270: Introduction to Database Systems

R4 = (C, D, E) Decompose on E → C

BCNF Decomposition (cont.)

R = (A, B, C, D, E, H) F = {A → BC, E → HA}

(Note: Fc = F)

Decomposition #2: R = R1 ∪ R3 ∪ R5 ∪ R6

Q: Not DP. Why? A: A → C not covered by R1, R3, R5 , R6.

R = (A, B, C, D, E, H) Decompose on A → B

R1 = (A, B) R2 = (A, C, D, E, H) Decompose on E → HA

R3 = (A, E, H)

R5 = (C, E) R6 = (E, D)

CSCI1270: Introduction to Database Systems

More BCNF (cont.)

Q: Can we decompose on FD’s in Fc to get a DP BCNF decomposition? A: Sometimes, BCNF + DP not possible

Decomposition #1: Decomposition #2:

R1 = (L, K) R2 = (J, L) R1 = (J, K, L) R2 = (J, L)

Not DP: JK → L not covered Still not in BCNF (L not a superkey)

R = (J, K, L) Decompose on L → K

R = (J, K, L) Decompose on JK → L

R = (J, K, L) F = {JK → L, L → K}

CSCI1270: Introduction to Database Systems

Aside

Is This a Realistic Example?

JK → L

L → K

A: BankerName → BranchName BranchName CustomerName → BankerName

Every banker works at one branch A customer works with the same banker at a given branch

CSCI1270: Introduction to Database Systems

Testing for FDs Across Relations •  Decomposition not dependency preserving => an extra

materialized view (MV) for each dependency α →β in Fc that is not preserved in the decomposition

•  The MV is a projection on α β of the join of the relations in the decomposition

•  DBMS maintains MV when the relations are updated. è No extra coding effort for programmer.

- Space overhead: storing MV - Time overhead: keeping MV up to date

CSCI1270: Introduction to Database Systems

Multivalued Dependencies

•  There are database schemas in BCNF that do not seem to be sufficiently normalized

•  Consider a database classes(course, teacher, book)

•  The database lists for each course the set of teachers any one of which can be the course’s instructor, and the set of books, all of which are required for the course (no matter who teaches it).

CSCI1270: Introduction to Database Systems

(course, teacher, book) is the only key, and therefore the relation is in BCNF

Insertion anomalies – i.e., if Sara is a new teacher that can teach database, two tuples need to be inserted

(database, Sara, DB Concepts)(database, Sara, Ullman)

course teacher bookdatabasedatabasedatabasedatabasedatabasedatabaseoperating systemsoperating systemsoperating systemsoperating systems

AviAviHankHankSudarshanSudarshanAviAvi Jim Jim

DB ConceptsUllmanDB ConceptsUllmanDB ConceptsUllmanOS ConceptsShawOS ConceptsShaw

classes

CSCI1270: Introduction to Database Systems

Therefore, it is better to decompose classes into:

course teacherdatabasedatabasedatabaseoperating systemsoperating systems

AviHankSudarshanAvi Jim

teaches

course bookdatabasedatabaseoperating systemsoperating systems

DB ConceptsUllmanOS ConceptsShaw

textWe shall see that these two relations are in Fourth Normal Form (4NF)

CSCI1270: Introduction to Database Systems

Multivalued Dependencies (MVDs) Let R be a relation schema and let α ⊆ R and β ⊆ R.

The multivalued dependency α →→ β

holds on R if in any legal relation r(R), for all pairs for tuples t1 and t2 in r such that t1[α] = t2 [α], there exist tuples t3 and t4 in r such that:

t1[α] = t2 [α] = t3 [α] = t4 [α] t3[β] = t1 [β] t3[R – β] = t2[R – β] t4 β] = t2[β] t4[R – β] = t1[R – β]

CSCI1270: Introduction to Database Systems

MVD (Cont.) Tabular representation of α →→ β

CSCI1270: Introduction to Database Systems

Example •  Let R be a relation schema with a set of attributes

that are partitioned into 3 nonempty subsets.Y, Z, W

•  We say that Y →→ Z (Y multidetermines Z)if and only if for all possible relations r(R)

< y1, z1, w1 > ∈ r and < y1, z2, w2 > ∈ rimplies

< y1, z1, w2 > ∈ r and < y1, z2, w1 > ∈ r•  Note that since the behavior of Z and W are

identical it follows that Y →→ Z if Y →→ W

CSCI1270: Introduction to Database Systems

Example (Cont.) •  In our example:

course →→ teacher course →→ book

•  The above formalizes the notion that a particular value of Y (course) has associated with it a set of values of Z (teacher) and a set of values of W (book), and these two sets are in some sense independent of each other.

Note: If Y → Z then Y →→ ZIndeed we have (in above notation) Z1 = Z2

The claim follows.

CSCI1270: Introduction to Database Systems

Use of Multivalued Dependencies •  We use multivalued dependencies in two ways:

1. To test relations to determine whether they are legal under a given set of functional and multivalued dependencies

2. To specify constraints on the set of legal relations. We shall thus concern ourselves only with relations that satisfy a given set of functional and multivalued dependencies.

•  If a relation r fails to satisfy a given multivalued dependency, we can construct a relation rʹ that does satisfy the multivalued dependency by adding tuples to r.

CSCI1270: Introduction to Database Systems

Fourth Normal Form

•  A relation schema R is in 4NF with respect to a set D of functional and multivalued dependencies if for all multivalued dependencies in D+ of the form α →→ β, where α ⊆ R and β ⊆ R, at least one of the following hold:α →→ β is trivial (i.e., β ⊆ α or α ∪ β = R)α is a superkey for schema R

•  If a relation is in 4NF it is in BCNF