db systems - data modeling

126
1 Data modeling Semester 1 – 2014-2015 Dr. Vo Thi Ngoc Chau ([email protected]) Faculty of Computer Science & Engineering Ho Chi Minh City University of Technology

Upload: hoangtucuagio

Post on 14-Nov-2015

77 views

Category:

Documents


5 download

DESCRIPTION

DB Systems - Data Modeling

TRANSCRIPT

  • 1

    Data modeling

    Semester 1 2014-2015

    Dr. Vo Thi Ngoc Chau

    ([email protected])

    Faculty of Computer Science & Engineering

    Ho Chi Minh City University of Technology

  • 2

    Main references

    [1] R. Elmasri, S. R. Navathe, Fundamentals of Database Systems- 4th Edition, Pearson- Addison Wesley, 2003.

    [2] H. G. Molina, J. D. Ullman, J. Widom, Database System Implementation, Prentice-Hall, 2000.

    [3] H. G. Molina, J. D. Ullman, J. Widom, Database Systems: The Complete Book, Prentice-Hall, 2002

    [4] A. Silberschatz, H. F. Korth, S. Sudarshan, Database System Concepts 3rd Edition, McGraw-Hill, 1999.

    [5] R. Ramakrishnan, J. Gehrke, Database Management Systems- 2rd Edition, McGraw-Hill.

    [6] M. P. Papazoglou, S. Spaccapietra, Z. Tari, Advances in Object-Oriented Data Modeling, MIT Press, 2000.

    [7]. G. Simsion, Data Modeling: Theory and Practice, Technics Publications, LLC, 2007.

  • 3

    Content

    2.1. Concepts

    2.2. Why is data modeling important?

    2.3. Data modeling with entity-

    relationship and relational models

    2.4. Dependencies and normalization

    2.5. Conclusion

  • 2.1. Concepts

    Data modeling

    Conceptual data model

    Entity relationship model

    Representational data model

    Relational data model

    Database design

    Relational database design 4

  • Data modeling

    Modeling - Cambridge dictionary

    Model = a representation of something, either

    as a physical object which is usually smaller

    than the real object, or as a simple description

    of the object which might be used in calculations

    Modeling = constructing a representation of

    something, either as a physical object which is

    usually smaller than the real object, or as a

    simple description of the object which might be

    used in calculations 5

  • Data modeling

    Data modeling [7], pp. 31-32

    formalizing and representing the data structures of reality

    Shoval and Frumermann, 1994, p28

    a representation of the things of significance to an enterprise and the relationships among those things

    Hay, 1996a

    an attempt to capture the essence of things both concrete and abstract

    Keuffel, 1996

    6

  • Data modeling

    Data modeling [7], pp. 31-32

    an abstract representation of the data about entities, events, activities, and their associations within an organization

    McFadden, Hoffer et al., 1999

    The core idea underlying all the definitions is the same: a data model is used for describing entities and their relationships within a core domain.

    Topi and Ramesh, 2002

    7

  • Data modeling

    Data modeling [7], pp. 31-32

    Data modeling is generally viewed as a design activity

    Srinivasan and Teeni, 1990

    an activity that involves the creation of abstractions

    Davydov, 1994

    the art and science of arranging the structure and relationship of data

    McComb, 2004, p293

    data modeling is a design discipline

    Simsion and Witt, 2005, p7 8

  • Conceptual data model

    A data model (aka semantic data model)

    Provides concepts close to the way many users perceive data

    Provides the concepts essential for supporting the application environment at a very high non-system-specific level

    Used for a conceptual schema of a database from data requirements

    Example: entity-relationship model

    9

  • Conceptual data model

    Characteristics of a conceptual data model

    Expressiveness

    Distinctions between data, relationships, constraints

    Simplicity

    Simple enough for an end user to use and understand an easy diagrammatic notation

    Minimality

    A small number of basic concepts that are distinct and orthogonal in their meaning

    Formality

    Concepts must be formally defined state criteria for the validity of a schema in the model

    Unique interpretation

    Complete and unambiguous semantics for each modeling construct 10

    S. Navathe. Evolution of data modeling for databases. Communications of the ACM 35(9)(1992) 112-123.

  • Representational data model

    A data model (aka implementation/logical data model)

    Provides concepts understood by end-users but not too far removed from the way data is organized within the computer

    Hides some details of data storage but able to be implemented on a computer system in a direct way (in some DBMS)

    Example: relational data model

    11

  • Database design

    Design the logical and physical structure of one or

    more databases to accommodate the information

    needs of the users in an organization for a defined

    set of applications

    The overall database design activity has to undergo

    a systematic process called the design methodology,

    whether the target database is managed by an

    RDBMS, ODBMS, or ORDBMS, ...

    The result of the design activity is a rigidly defined

    database schema that cannot easily be modified

    once the database is implemented. 12

  • Database design

    Goals:

    Satisfy the information content requirements of

    the specified users and applications

    Provide a natural and easy-to-understand

    structuring of the information

    Support processing requirements and any

    performance objectives, such as response time,

    processing time, and storage space 13

  • Database design

    14

    Figure 2.1 Phases of database

    design and implementation for

    large databases

    [1], pp. 368

  • Database design

    Six main phases of the overall database

    design and implementation process

    Requirements collection and analysis

    Conceptual database design

    Choice of a DBMS

    Data model mapping (also called logical

    database design)

    Physical database design

    Database system implementation and tuning 15

  • Database design Phase 1: Requirements Collection and Analysis

    Identify the major application areas and user groups that will use the database or whole work will be affected by the database

    Study and analyze existing documentation concerning the applications

    Study the current operating environment and planned use of the information

    Analyze the types of transactions and their frequencies as well as of the flow of information within the system

    Study geographic characteristics regarding users, origin of transactions, destination of reports,

    Specify the input and output data for the transactions

    Collect written responses to sets of questions from the potential database users or user groups

    Users priorities and the importance users place on various applications

    Requirements from users and applications of the information system that will interact with the database system

    Time-consuming but crucial to the success of the information system

    Know and analyze the expectations of the users and the intended uses of the database in as much detail as possible 16

  • Database design

    Phase 2: Conceptual Database Design

    Phase 2a: Conceptual Schema Design

    Examine the data requirements resulting from Phase 1

    Produce a conceptual database schema in a DBMS-independent high-level data model

    Phase 2b: Transaction Design

    Design the characteristics of known database transactions (applications) in a DBMS-independent way to ensure that the database schema will include all the information required by these transactions

    Identify transactions input/output and functional behavior

    Group transactions into three categories: retrieval transactions, update transactions, mixed transactions 17

  • Database design

    Phase 2: Conceptual Database Design

    Phase 2a: Conceptual Schema Design

    Goal = a complete understanding of the database structure, meaning (semantics), interrelationships, and constraints

    Identify: entity types, relationship types, attributes, key attributes, cardinality and participation constraints on relationships, weak entity types, specialization/generalization hierarchies

    Approaches

    The centralized (or one-shot) schema design approach

    The view integration approach 18

  • Database design

    Phase 2a: Conceptual Schema Design

    The centralized (or one-shot) schema design approach

    The requirements of the different applications and user groups from Phase 1 are merged into a single set of requirements before schema design begins.

    A single schema corresponding to the merged set of requirements is then designed.

    The DBA is responsible for deciding how to merge the requirements and for designing the conceptual schema for the whole database

    Once the conceptual schema is designed and finalized, external schemas for the various user groups and applications can be specified by the DBA.

    19

  • Database design

    Phase 2a: Conceptual Schema Design

    The view integration approach

    The requirements are not merged.

    A schema (or view) is designed for each user group or application based only on its own requirements.

    Each user group or application

    These schemas are merged or integrated into a global conceptual schema for the entire database.

    The DBA

    The individual views can be reconstructed as external schemas after view integration.

    20

  • Database design

    Phase 3: Choice of a DBMS

    Technical factors

    Suitability & type of the DBMS for the task

    Storage structures and access paths, the user and programmer interfaces, high-level query languages, development tools, architectural options, supported by DBMS

    DBMS portability among different types of hardware

    Non-technical factors

    Cost: software acquisition cost, maintenance cost, hardware acquisition cost, database creation and conversion cost, personnel cost, training cost, operating cost

    Availability of vendor services

    21

  • Database design

    Phase 4: Data Model Mapping (Logical

    Database Design)

    Create a conceptual schema and external

    schemas in the data model of the selected DBMS

    The result of this phase should be DDL

    statements in the language of the chosen DBMS

    that specify the conceptual and external level

    schemas of the database system. 22

  • Database design

    Phase 5: Physical Database Design

    Restricted to choosing the most appropriate structures for the database files from among the options offered by that DBMS

    Choose specific storage structures and access paths for the database files

    Response time

    Space utilization

    Transaction throughput

    Estimate record size and number of records in each database file

    Estimate the update and retrieval patterns for the file cumulatively from all the transactions

    Estimate the file growth, either in the record size because of new attributes or in the number of records 23

  • Database design

    Phase 6: Database System Implementation and Tuning

    Create the database schemas and (empty) database files

    Responsibility of the DBA in conjunction with the database designers

    Reformat the data for loading into the new database if needed

    Load/populate with the data if needed

    Implement database transactions referring to the conceptual specifications of transaction, then write and test program code with embedded DML commands

    Responsibility of the application programmers

    Database tuning continues as long as the database is in existence, as long as performance problems are discovered, and while the requirements keep changing.

    24

  • Relational database design

    Two main approaches

    A top-down design technique

    Designing a conceptual schema in a high-level data model

    Mapping the conceptual schema into a relational database schema

    Analyzing each of the resulting relations based on the functional dependencies and assigned primary keys

    A bottom-up design technique (relational synthesis)

    View relational database schema design strictly in terms of functional and other types of dependencies specified on the database attributes

    Synthesize the relation schemas applying a normalization algorithm

    25

  • 2.2. Why is data modeling important?

    Leverage

    Make programming simpler and cheaper

    Poor data organization can be expensive to fix.

    Conciseness

    Take more directly to the heart of the business requirements

    Data quality

    Problems with data quality can be traced back to a lack of consistency in:

    Defining and interpreting data

    Implementing mechanisms to enforce the definitions

    A key role in achieving good data quality by establishing a common understanding of what is to be held in each table and column and how it is to be interpreted

    26 G. C. Simsion, G. C. Witt, Data modeling essentials 3rd edition, Elsevier Inc, 2005, pp. 8-10.

  • 2.3. Data modeling with entity-relationship and relational models

    Entity-relationship model

    Relational data model

    Data model mapping

    27

  • Database design

    28

    Figure 2.1 Phases of database

    design and implementation for

    large databases

    [1], pp. 368

    Entity Relationship Model

    P. P-S. Chen. The Entity-

    Relationship Model Toward

    a Unified View of Data. ACM

    Transactions on Database

    Systems 1(1)(March 1976)

    9-36.

  • Entity Relationship Model P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.

    The entity-relationship model adopts the more natural view that the real world consists of entities and relationships.

    The model can achieve a high degree of data independence and is based on set theory and relation theory.

    The entity-relationship model can be used as a basis for a unified view of data.

    A special diagrammatric technique, the entity-relationship diagram, is introduced as a tool for database design.

    29

  • Entity Relationship Model P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.

    30 Fig. 1. Analysis of data models using multiple levels of logical views

  • Entity Relationship Model P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.

    The entity-relationship (ER) modeling

    concepts

    Entity types

    Relationship types

    Attributes

    Keys

    Structural constraints

    31

  • Entity Relationship Model P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.

    32

    Symbol Meaning Example

    Entity type Employees

    Weak entity type Dependents of each employee

    Relationship type Employee

    works_on Project

    Identifying relationship of the weak entity type

    Dependents dependents_of

    Employees

    Employee

    Dependent

    works_on

    dependents_of

  • Entity Relationship Model P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.

    33

    Symbol Meaning Example

    Attribute Name of an employee

    Key Attribute Distinct identifier of

    an employee

    Multivalued Attribute

    Phone numbers of an employee

    Composite Attribute

    Address (Street, District, City) of an

    employee

    Derived Attribute Age of an employee

    (derived from date of birth)

    Name

    EmployeeID

    PhoneNumber

    Address

    Age

    Street District City

  • Entity Relationship Model P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.

    34

    Constraints

  • 35

    Figure 3.2. An ER schema diagram for the COMPANY database [1], pp. 54.

  • 36

    Figure 3.18. An ER diagram for a BANK database schema [1], pp. 81.

    a. List the entity types

    b. Is there a weak entity type? If so, give its name, partial key, and identifying relationship

    c. What constraints do the partial key and the identifying relationship of the weak entity type specify in this diagram?

    d. List the names of all relationship types, and specify the (min, max) constraint on each participation of an entity type in a relationship type.

  • Enhanced Entity Relationship Modeling

    Class/subclass relationships and type inheritance

    The relationship between a superclass and any one of its subclasses

    Specialization

    Define a set of subclasses of an entity type

    Establish additional specific attributes with each subclass

    Establish additional specific relationship types between each subclass and other entity types or other subclasses

    Generalization

    A reverse process of abstraction in which we suppress the differences among several entity types

    Identify their common features

    Generalize them into a single superclass of which the original entity types are special subclasses

    37

  • 38 Figure 4.1. EER diagram notation to represent subclasses and specialization. [1], pp. 87.

    total specialization partial specialization

  • 39

    Figure 4.3. Generalization [1], pp. 91.

  • 40 Figure 4.5. EER diagram notation for overlapping (nondisjoint) specialization. [1], pp. 93.

    - Disjoint (d): the subclasses of the specialization must be disjoint.

    - Overlapping (o): the subclasses are not constrained to be disjoint, their sets of entities may overlap.

  • 41

    Figure 4.7. A specialization lattice with multiple inheritance for a UNIVERSITY database

    [1], pp. 96.

    a. List superclasses, subclasses

    of each superclass

    b. List class/subclass

    relationships

    c. Describe constraints on each

    specialization

  • Database design

    42

    Figure 2.1 Phases of database

    design and implementation for

    large databases

    [1], pp. 368

    A representational data

    model supported by the

    chosen DBMS

    Relational Data Model

    E. F. Codd. A Relational

    Model of Data for Large

    Shared Data Banks.

    Communications of the ACM

    13(6)(June, 1970) 377-387.

  • Data model E. F. Codd. Data models in database management, ACM, 1980.

    A combination of three following components

    (1). A collection of data structure types (the building

    blocks of any database that conforms to the model);

    (2). A collection of operators or inferencing rules, which

    can be applied to any valid instances of the data types

    listed in (1), to retrieve or derive data from any parts

    of those structures in any combinations desired;

    (3). A collection of general integrity rules, which

    implicitly or explicitly define the set of consistent

    database states or changes of state or both -- these

    rules may sometimes be expressed as insert-update-

    delete rules 43

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    The term relation is used here in its accepted mathematical sense.

    A relational database based on the relational data model is a collection of relations.

    Given sets S1, S2, , Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on.

    Sj is the jth domain of R.

    R is said to have degree n.

    More concisely, R is a subset of the Cartesian product S1xS2xxSn. 44

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    45

    Fig. 1. A relation of degree 4

    A relation of degree 4, called supply, which reflects the shipments-in-progress

    of parts from specified suppliers to specified projects in specified quantities.

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    An n-ary relation R has the following properties:

    Each row represents an n-tuple of R.

    The ordering of rows is immaterial.

    All rows are distinct.

    The ordering of columns is significant it corresponds to the ordering S1, S2, , Sn of the domains on which R is defined.

    The significance of each column is partially conveyed by labeling it with the name of the corresponding domain.

    46

  • Data model E. F. Codd. Data models in database management, ACM, 1980.

    A combination of three following components

    (1). A collection of data structure types (the building

    blocks of any database that conforms to the model);

    (2). A collection of operators or inferencing rules, which

    can be applied to any valid instances of the data types

    listed in (1), to retrieve or derive data from any parts

    of those structures in any combinations desired;

    (3). A collection of general integrity rules, which

    implicitly or explicitly define the set of consistent

    database states or changes of state or both -- these

    rules may sometimes be expressed as insert-update-

    delete rules 47

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    Operations on relations

    Usual set operations: union, intersection, minus

    Compatible relations

    Projection ()

    Select certain columns of a relation (striking out the others)

    Selection ()

    Select a subset of the tuples from a relation that satisfy a selection condition

    Cartesian product (cross product) (x)

    Combine tuples from two relations in a combinatorial fashion

    Join (inner/outer join, equijoin, natural join)

    Combine related tuples from two relations into single tuples

  • 49

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    50

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    51

    T R : S

  • Data model E. F. Codd. Data models in database management, ACM, 1980.

    A combination of three following components

    (1). A collection of data structure types (the building

    blocks of any database that conforms to the model);

    (2). A collection of operators or inferencing rules, which

    can be applied to any valid instances of the data types

    listed in (1), to retrieve or derive data from any parts

    of those structures in any combinations desired;

    (3). A collection of general integrity rules, which

    implicitly or explicitly define the set of consistent

    database states or changes of state or both -- these

    rules may sometimes be expressed as insert-update-

    delete rules 52

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    Constraints inherent in the data model

    Domain constraints

    Key constraints

    Constraints on nulls

    Entity integrity constraints

    Referential integrity constraints

    53

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    Domain constraints

    Within each tuple, the value of each attribute A must be an atomic value from the domain dom(A).

    Key constraints

    1. Two distinct tuples in any state of the relation cannot have identical values for (all) the attributes in the key.

    2. It is a minimal superkey that is, a superkey from which we cannot remove any attributes and still have the uniqueness constraint in condition 1 hold.

    Superkey of the relation schema R specifies a uniqueness constraint that no two distinct tuples in any state r of R can have the same value for the superkey. 54

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    Key constraints

    A relation schema may have more than one key.

    Candidate keys: primary key and secondary keys

    Constraints on nulls

    Specify whether null values are or are not permitted on attributes

    Entity integrity constraint

    No primary key value can be null.

    Referential integrity constraint

    Specify a referential integrity constraint between the two relation schemas R1 and R2 55

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    Referential integrity constraint

    A set of attributes FK (foreign key) in relation

    schema R1 is a foreign key of R1 that references

    relation R2 if it satisfies the following two rules:

    The attributes in FK have the same domain(s) as the

    primary key attributes PK of R2; the attributes FK are

    said to reference or refer to the relation R2.

    A value of FK in a tuple t1 of the current state r1(R1)

    either occurs as a value of PK for some tuple t2 in the

    current state r2(R2) or is null. In the former case, we

    have t1[FK] = t2[PK], and we say that the tuple t1

    references or refers to the tuple t2. 56

  • Relational Data Model E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6)(June, 1970) 377-387.

    57

    Relation Employee

    Relation schema: EMPLOYEE (FNAME, MINIT, LNAME, SSN,

    BDATE, ADDRESS, SEX, SALARY, SUPERSSN, DNO)

    Attributes = FNAME, MINIT, LNAME, SSN,

    Domain of Attribute SEX = {F, M}

    Domain of SALARY = a set of positive integer numbers

    Primary key = SSN

    Foreign key = SUPERSSN

  • 58

    The SQL standard has gone through a number of revisions.

    http://en.wikipedia.org/wiki/SQL (Accessed on 09/05/2012)

    High-level database language on relational databases: SQL-86, SQL-89, SQL-92, SQL-99,

  • SQL (structured query language)

    Data Definition Language (DDL): Create,

    Alter, Drop

    Data Manipulation Language (DML): Select,

    Insert, Update, Delete

    Data Control Language (DCL): Commit,

    Rollback, Grant, Revoke

    59

  • SQL (structured query language)

    60

    CREATE TABLE TableName

    {(colName dataType [NOT NULL] [UNIQUE]

    [DEFAULT defaultOption]

    [CHECK searchCondition] [,...]}

    [PRIMARY KEY (listOfColumns),]

    {[UNIQUE (listOfColumns),] [,]}

    {[FOREIGN KEY (listOfFKColumns)

    REFERENCES ParentTableName [(listOfCKColumns)],

    [ON UPDATE referentialAction]

    [ON DELETE referentialAction ]] [,]}

    {[CHECK (searchCondition)] [,] })

  • SQL (structured query language)

    61

    SELECT [DISTINCT | ALL]

    {* | [columnExpression [AS newName]] [,...] }

    FROM TableName [alias] [, ...]

    [WHERE condition]

    [GROUP BY columnList]

    [HAVING condition]

    [ORDER BY columnList]

  • Database design

    62

    Figure 2.1 Phases of database

    design and implementation for

    large databases

    [1], pp. 368

    Data model mapping

    from a conceptual

    schema based on the ER

    model to a relational

    database schema based

    on the relational data

    model

  • Data model mapping

    63

    Table 7.1. Correspondence between ER and Relational Models

    [1], pp. 198.

  • Data model mapping

    Mapping of specialization or generalization

    Convert each specialization with m subclasses {S1, S2, , Sm} and (generalized) superclass C, where the attributes of C are {k, a1, , an} and k is the (primary) key, into relation schemas using one of the four following options:

    Multiple relations Superclass and subclasses

    Multiple relations Subclass relations only (total)

    Single relation with one type attribute (disjoint)

    Single relation with multiple type attributes (overlapping) 64

  • 65

    Figure 4.3. Generalization [1], pp. 91.

    Using Multiple relations Subclass relations only (total) Figure 7.4, pp. 200.

  • 66

    Figure 7.4. Using Single relation with multiple type attributes with Boolean fields MFlag and PFlag. [1], pp. 200.

  • Data model mapping

    Mapping algorithms

    Automatically create a relational schema from a conceptual schema design

    Notes on data model mapping

    Slightly different for other post-relational data models as compared to the relational data model

    Constraints on databases

    Inherent model-based constraints

    Schema-based constraints

    Application-based constraints 67

  • 68

  • 69

  • 70

  • 71

  • Query

    72

    Example: For every project located in Stafford, retrieve the project number, the controlling department number and the department managers last name, address and birthdate.

    SQL query:

    Relational algebra expression:

    SELECT P.NUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE

    FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E

    WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND

    P.PLOCATION=STAFFORD;

    PNUMBER, DNUM, LNAME, ADDRESS, BDATE(((PLOCATION=STAFFORD(PROJECT))

    DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))

  • Query: For each project on which more than two employees work, retrieve the project number, project name, and the number of employees who work on that project.

    73

    SELECT PNUMBER, PNAME, COUNT (*) AS PROJ_NUMBER

    FROM PROJECT, WORKS_ON

    WHERE PNUMBER=PNO

    GROUP BY PNUMBER, PNAME

    HAVING COUNT (*) > 2

    Temp (PNUMBER, PNAME, PROJ_NUMBER) PNUMBER, PNAME COUNT *

    (PNUMBER=PNO(PROJECT X WORKS_ON)) Result PROJ_NUMBER > 2 (Temp)

  • Query: retrieve the last name and first name of each employee

    whose salary is greater than the max salary in department 5.

    74

    DNO = 5

    EMPLOYEE

    MAX SALARY

    SALARY > C

    EMPLOYEE

    LNAME, FNAME

  • Data model mapping

    Informal measures of quality for relation schema design

    Semantics of the attributes

    How to interpret the attribute values stored in a tuple of the relation how the attribute values in a tuple relate to one another

    Reducing the redundant values in tuples

    Storage space & update anomalies

    Reducing the null values in tuples

    Multiple interpretations of nulls

    Disallowing the possibility of generating spurious tuples

    Join relations with equality conditions on attributes that are either primary keys or foreign keys in a way that guarantees that no spurious tuples are generated

    75

  • 2.4. Dependencies and normalization

    Functional dependency (FD)

    The single most important concept in relational schema design theory

    A constraint between two sets of attributes from the database

    A property of the semantics or meaning of the attributes.

    Examples

    SSN ENAME

    The value of an employees social security number (SSN) uniquely determines the employee name (ENAME)

    PNUMBER {PNAME, PLOCATION}

    {SSN, PNUMBER} HOURS 76

  • 2.4. Dependencies and normalization

    Functional dependency (FD)

    A constraint between two sets of attributes from the database

    A constraint on the possible tuples that can form a relation state r of a relation schema R = {A1, A2, , An}

    Denoted by XY

    X, Y: two sets of attributes that are subsets of R

    For any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].

    The values of Y of a tuple in r depend on (are determined by) the values of X.

    The values of X of a tuple uniquely (functionally) determine the values of Y.

    If XY in R, this does not say whether or not YX in R. 77

  • 2.4. Dependencies and normalization

    Armstrongs inference rules (IR1-3) and others (IR4-6)

    78

  • 2.4. Dependencies and normalization

    Closure F+ of a set of functional dependencies

    F specified on relation schema R

    F+, the closure of F, is formally the set of all

    dependencies that include F as well as all

    dependencies that can be inferred from F.

    79

  • 2.4. Dependencies and normalization

    Closure F+ of a set of functional dependencies F

    F = {SSN{ENAME, BDATE, ADDRESS, DNUMBER},

    DNUMBER{DNAME, DMGRSSN}}

    F+ = {SSN{ENAME, BDATE, ADDRESS, DNUMBER},

    DNUMBER{DNAME, DMGRSSN}, SSNSSN, ,

    DNUMBER DNAME, SSN{DNAME, DMGRSSN}} 80

  • 2.4. Dependencies and normalization

    Closure F+ of a set of functional dependencies F

    F = {{SSN, PNUMBER}HOURS, SSNENAME,

    PNUMBER{PNAME, PLOCATION}}

    F+ = ??? 81

  • 2.4. Dependencies and normalization

    Closure X+ of a set of attributes X under F

    where X is a set of attributes that appears as a

    left-hand side of a functional dependency in F

    X+, the closure of X under F, is the set

    of attributes that are functionally

    determined by X based on F.

    82

  • 2.4. Dependencies and normalization

    Closure X+ of a set of attributes X under F

    where X is a set of attributes that appears as a

    left-hand side of a functional dependency in F

    83

  • 2.4. Dependencies and normalization

    Closure X+ of a set of attributes X under F

    where X is a set of attributes that appears as a

    left-hand side of a functional dependency in F

    F = {{SSN, PNUMBER}HOURS, SSNENAME,

    PNUMBER{PNAME, PLOCATION}}

    {SSN, PNUMBER}+ = {SSN, PNUMBER, HOURS, ENAME,

    PNAME, PLOCATION}

    {SSN}+ = {SSN, ENAME}

    {PNUMBER}+ = {PNUMBER, PNAME, PLOCATION} 84

  • 2.4. Dependencies and normalization

    Closure X+ of a set of attributes X under F

    where X is a set of attributes that appears as a

    left-hand side of a functional dependency in F

    F = {{A, B}{C}, {A}{D, E}, {B}{M}, {M}{G, H},

    {D}{I, J}}

    {A, B}+ = ???

    {A}+ = ???

    {B}+ = ???

    {M}+ = ???, and {D}+ = ??? 85

  • 2.4. Dependencies and normalization

    Identifying a key K of R based on a set of given functional dependencies F

    86

  • 2.4. Dependencies and normalization

    Identifying a key K of R based on a set of

    given functional dependencies F

    R = (A, B, C, D, E, G)

    F = {CE, AEG, EAD}

    K = ???

    87

  • 2.4. Dependencies and normalization

    Identifying a key K of R based on a set of given functional dependencies F

    R = (A, B, C, D, E, G)

    F = {CE, AEG, EAD}

    K = {A, B, C, D, E, G}

    (K-{A})+ = {B, C, D, E, G, A} K = {B, C, D, E, G}

    (K-{B})+ = {C, D, E, G, A} K = {B, C, D, E, G}

    (K-{C})+ = {B, D, E, G, A} K = {B, C, D, E, G}

    (K-{D})+ = {B, C, E, G, A, D} K = {B, C, E, G}

    (K-{E})+ = {B, C, G, E, A, D} K = {B, C, G}

    (K-{G})+ = {B, C, E, A, D, G} K = {B, C} 88

  • 2.4. Dependencies and normalization

    Identifying a key K of R based on a set of

    given functional dependencies F

    R = (A, B, C, E, G)

    F = {AB, BCE, EGA}

    K = ???

    89

  • 2.4. Dependencies and normalization

    Identifying all keys K of R based on a set of given functional dependencies F

    Find N = U - fFright(f)

    N is a set of the attributes not in the right side of any functional dependency in F. U is a set of all the attributes of R.

    Find N+, the closure of N under F

    If N+ = U then there is only one key K = N.

    Otherwise, find D = fFright(f) - fFleft(f)

    D is a set of the attributes only in the right side of the functional dependencies in F.

    Find L = U (N+D)

    L is a set of the attributes neither in the right side nor functionally determined by N.

    Let Li be any subset of L. Find (NLi)+ under F.

    If (NLi)+ = U then K = NLi.

    Ki, Kj such that Ki Kj, remove Kj. 90

  • 2.4. Dependencies and normalization

    Identifying all keys K of R based on a set of given functional dependencies F

    R = (A, B, C, E, G)

    F = {AB, BCE, EGA}

    All keys K ???

    91

  • 2.4. Dependencies and normalization

    R = (A, B, C, E, G)

    F = {AB, BCE, EGA}

    N = U - fFright(f) = {C, G}

    N+ = {C, G}

    D = fFright(f) - fFleft(f) =

    L = U (N+D) = {A, B, E}

    Subsets of L: L1 = {A}, L2 = {B}, L3 = {E}, L4 = {A, B}, L5 = {A, E}, L6 = {B, E}, L7 = {A, B, E}

    (NL1)+ = {C, G, A, B, E} K1 = {C, G, A}

    (NL2)+ = {C, G, B, E, A} K2 = {C, G, B}

    (NL3)+ = {C, G, E, A, B} K3 = {C, G, E} 92

  • 2.4. Dependencies and normalization

    Identifying all keys K of R based on a set of

    given functional dependencies F

    R = (A, B, C, E, G)

    F = {ABC, CGE, BG, EA}

    All keys K ???

    93

  • 2.4. Dependencies and normalization

    Membership of a functional dependency XY

    with respect to a set of functional

    dependencies F, i.e. XY is inferred from F,

    denoted by F XY

    A functional dependency XY is inferred from a set of

    dependencies F specified on R if XY holds in every

    legal relation state r of R; that is, whenever r satisfies

    all the dependencies in F, XY also holds in r. 94

  • 2.4. Dependencies and normalization

    Membership of a functional dependency XY

    with respect to a set of functional

    dependencies F, i.e. XY is inferred from F,

    denoted by F XY

    Check if F XY

    Find X+, the closure of X under F

    If Y X+ then F XY

    95

  • 2.4. Dependencies and normalization

    Membership of a functional dependency XY

    with respect to a set of functional

    dependencies F, i.e. XY is inferred from F,

    denoted by F XY

    F = {A C, AC D, E AD, E H}

    F A CD

    A+ under F = {A, C, D} {C, D} F A CD

    F E AH ??? 96

  • 2.4. Dependencies and normalization

    A set of functional dependencies F is said to

    cover another set of functional dependencies

    E if every FD in E is also in F+; that is, if every

    dependency in E can be inferred from F.

    Alternatively, E is covered by F.

    Two sets of functional dependencies E and F

    are equivalent if E+ = F+; that is, every

    dependency in E can be inferred from F and

    every dependency in F can be inferred from E;

    that is also, E covers F and F covers E. 97

  • 2.4. Dependencies and normalization

    F = {A C, AC D, E AD, E H}

    G = {A CD, E AH}

    Are F and G equivalent?

    * Check if F covers G.

    * Check if G covers F.

    98

  • 2.4. Dependencies and normalization

    The minimal cover F of a set of functional

    dependencies E is a set of functional

    dependencies that satisfies the property that

    every dependency in E is in the closure F+ of

    F. In addition, this property is lost if any

    dependency from the set F is removed; F

    must have no redundancies in it, and the

    dependencies in F are in a standard form. 99

  • 2.4. Dependencies and normalization

    The minimal cover F of a set of functional

    dependencies E

    1. Every dependency in F has a single attribute for its right-

    hand side.

    2. We cannot replace any dependency X A in F with a

    dependency Y A, where Y is a proper subset of X, and still

    have a set of dependencies that is equivalent to F.

    3. We cannot remove any dependency from F and still have a

    set of dependencies that is equivalent to F. 100

  • 2.4. Dependencies and normalization

    The minimal cover F of a set of functional dependencies E

    101

  • 2.4. Dependencies and normalization

    The minimal cover F of a set of functional

    dependencies E

    E = {A B, ABCD I, IJ G, IJ H, ACDJ GI}

    The minimal cover F of E ???

    102

  • 2.4. Dependencies and normalization

    Normalization

    A process of analyzing the given relation schemas based on their FDs and primary keys to achieve the desirable properties of:

    (1) minimizing redundancy

    (2) minimizing the insertion, deletion, and update anomalies

    Normal form of a relation

    The highest normal form condition that the relation meets the degree to which it has been normalized

    103

  • 2.4. Dependencies and normalization

    The process of normalization through

    decomposition produces the relational

    schemas that have:

    The lossless join or non-additive join property

    No spurious tuple generation problem occurs.

    The dependency preservation property

    Each FD is represented in some individual relation.

    104

  • 2.4. Dependencies and normalization

    105

    Table 10.1. Summary of normal forms based on primary keys and corresponding normalization ([1], pp. 321)

  • 2.4. Dependencies and normalization

    Pure relations based on the relational data

    model are all in 1NF.

    Relations in 2NF have non-key attributes

    functionally dependent on the whole

    primary key.

    Relations in 3NF have non-key attributes

    only functionally dependent on the whole

    primary key. 106

  • 107

    Primary key = {SSN, PNUMBER}

    Non-key attributes = HOURS, ENAME, PNAME, PLOCATION

    FD1: {SSN, PNUMBER} HOURS

    FD2: SSN ENAME

    FD3: PNUMBER {PNAME, PLOCATION}

    Functionally dependent on a part of the primary key

  • 108

    Primary key = SSN

    Non-key attributes = ENAME, BDATE, ADDRESS,

    DNUMBER, DNAME, DMGRSSN

    FD1: SSN {ENAME, BDATE, ADDRESS, DNUMBER}

    FD2: DNUMBER {DNAME, DMGRSSN} Non-key attributes are functionally dependent on another non-key attribute.

  • 2.4. Dependencies and normalization

    Relation: LOTS (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA, PRICE, TAX_RATE)

    Primary key: PROPERTY_ID#

    Secondary key: {COUNTY_NAME, LOT#}

    Functional dependencies

    FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA, PRICE, TAX_RATE}

    FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA, PRICE, TAX_RATE}

    FD3: COUNTY_NAME TAX_RATE

    FD4: AREA PRICE 109

  • 110

    Is LOTS really in 1NF?

    Normalization is carried out as follows. Determine LOTS1, LOTS2, LOTS1A, and LOTS1B.

    Figure 10.11 (d). Summary of the progressive normalization of LOTS.

    [1], pp. 322.

  • 2.4. Dependencies and normalization

    Relation: LOTS (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA, PRICE, TAX_RATE)

    Primary key: PROPERTY_ID#

    Secondary key: {COUNTY_NAME, LOT#}

    Functional dependencies

    FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA, PRICE, TAX_RATE}

    FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA, PRICE, TAX_RATE}

    FD3: COUNTY_NAME TAX_RATE

    FD4: AREA PRICE 111

    2NF

  • 2.4. Dependencies and normalization

    Relation: LOTS1 (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA, PRICE) -- in 2NF

    Primary key: PROPERTY_ID#

    Secondary key: {COUNTY_NAME, LOT#}

    Functional dependencies

    FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA, PRICE}

    FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA, PRICE}

    FD4: AREA PRICE

    Relation: LOTS2 (COUNTY_NAME, TAX_RATE) -- in 3NF

    Primary key: COUNTY_NAME

    FD: COUNTY_NAME TAX_RATE

    112

    3NF

  • 2.4. Dependencies and normalization

    Relation: LOTS1A (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA) -- in 3NF

    Primary key: PROPERTY_ID#

    Secondary key: {COUNTY_NAME, LOT#}

    Functional dependencies

    FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA}

    FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA}

    Relation: LOTS1B (AREA, PRICE) -- in 3NF

    Primary key: AREA

    FD: AREA PRICE

    Relation: LOTS2 (COUNTY_NAME, TAX_RATE) -- in 3NF

    Primary key: COUNTY_NAME

    FD: COUNTY_NAME TAX_RATE 113

  • 2.4. Dependencies and normalization

    Boyce-Codd normal form (BCNF)

    A simpler form of 3NF, but stricter than 3NF

    Every relation in BCNF is also in 3NF; however, a relation in 3NF is not necessarily in BCNF.

    For a relation in 3NF, each non-key attribute must:

    Fully functionally dependent on every key

    Nontransitively dependent on every key

    Not functionally dependent on other nonkey attributes

    For a relation in BCNF, all key/non-key attributes must be functionally dependent on superkeys.

    XA holds in R => X is a superkey of R.

    XA holds in R and X is not a superkey of R => R is not in BCNF.

    114

  • 2.4. Dependencies and normalization

    115

    Figure 10.12. (a) BCNF normalization of LOTS1A [1], pp. 325. FD5: AREA COUNTY_NAME

    AREA is not a superkey of LOTS1A. LOTS1A is in 3NF; but not in BCNF.

  • 2.4. Dependencies and normalization

    116

    Candidate key = {STUDENT, COURSE}

    FD1: {STUDENT, COURSE} INSTRUCTOR

    FD2: INSTRUCTOR COURSE

    Normalization???

  • 2.4. Dependencies and normalization

    Multivalued dependencies

    A multivalued dependency X Y specified on

    relation schema R, where X and Y are both

    subsets of R, specifies the following constraint on

    any relation state r of R: if two tuple t1 and t2

    exist in r such that t1[X] = t2[X], then two tuples

    t3 and t4 should also exist in r with the following

    properties, where Z is used to denote (R-(XUY)):

    t3[X] = t4[X] = t1[X] = t2[X]

    t3[Y] = t1[Y] and t4[Y] = t2[Y]

    t3[Z] = t2[Z] and t4[Z] = t1[Z] 117

  • 2.4. Dependencies and normalization

    Multivalued dependencies

    ENAME PNAME

    ENAME DNAME

    No association between PNAME and DNAME 118

  • 2.4. Dependencies and normalization

    Multivalued dependencies (MVDs)

    A multivalued dependency X Y is a trivial MVD

    if:

    (a) Y is a subset of X

    (b) X U Y = R

    A trivial MVD does not specify any significant or

    meaningful constraint on R.

    A nontrivial MVD satisfies neither (a) nor (b).

    119

  • 2.4. Dependencies and normalization

    Fourth normal form

    A relation schema R is in 4NF with respect to a

    set of dependencies (that includes functional

    dependencies and multivalued dependencies) if,

    for every nontrivial multivalued dependency X Y

    in F+, X is a superkey for R.

    120

  • 2.5. Conclusion

    Data modeling

    Database design process: 6 phases

    Conceptual database design

    Entity-relationship model

    Choice of a DBMS a representational data model

    Relational data model

    Data model mapping

    Functional/multivalued dependencies

    Normalization: 1NF, 2NF, 3NF, BCNF, 4NF 121

  • 122

    Questions ???

  • Review - Chapter 12 in [1] 12.1. What are the six phases of database design? Discuss each phase.

    12.2. Which of the six phases are considered the main activities of the database design process itself? Why?

    12.3. Why is it important to design the schemas and applications in parallel?

    12.4. Why is it important to use an implementation-independent data model during conceptual schema design? What models are used in current design tools? Why?

    12.5. Discuss the importance of Requirements Collection and Analysis.

    12.7. Discuss the characteristics that a data model for conceptual schema design should possess.

    12.8. Compare and contrast the two main approaches to conceptual schema design.

    12.9. Discuss the strategies for designing a single conceptual schema from its requirements.

    12.10. What are the steps of the view integration approach to conceptual schema design? What are the difficulties during each step?

    12.13. Discuss the factors that influence the choice of a DBMS package for the information system of an organization.

    12.14. What is system-independent data model mapping? How is it different from system-dependent data model mapping?

    12.15. What are the important factors that influence physical database design? 123

  • Review - Chapters 3, 4 in [1] 3.1. Discuss the role of a high-level data model in the database design process.

    3.2. List the various cases where use of a null value would be appropriate.

    3.3. Define the following terms: entity, attribute, attribute value, relationship instance, composite attribute, multivalued attribute, derived attribute, complex attribute, key attribute, value set (domain).

    3.4. What is an entity type? What is an entity set? Explain the differences among an entity, an entity type, and an entity set.

    3.6. What is a relationship type? Explain the differences among a relationship instance, a relationship type, and a relationship set.

    3.7. What is a participation role? When is it necessary to use role names in the description of relationship types?

    3.12. When is the concept of a weak entity used in data modeling? Define the terms owner entity type, weak entity type, identifying relationship type, and partial key.

    4.1. What is a subclass? When is a subclass needed in data modeling?

    Define the following terms: superclass of a subclass, superclass/subclass relationship, is-a relationship, specialization, generalization, category, specific (local) attributes) specific relationships.

    4.6. Discuss the two main types of constraints on specializations and generalizations.

    124

  • Review - Chapter 5 in [1]

    5.1. Define the following terms: domain, attribute, n-tuple,

    relation schema, relation state, degree of a relation, relational

    database schema, relational database state.

    5.2. Why are tuples in a relation not ordered?

    5.3. Why are duplicate tuples not allowed in a relation?

    5.4. What is the difference between a key and a superkey?

    5.8. Discuss the entity integrity and referential integrity

    constraints. Why is each considered important?

    5.9. Define foreign key. What is this concept used for?

    125

  • Review - Chapter 10 in [1] 10.1. Discuss attribute semantics as an informal measure of goodness for a relation schema.

    10.2. Discuss insertion, deletion, and modification anomalies. Why are they considered bad? Illustrate with examples.

    10.3. Why should nulls in a relation be avoided as far as possible? Discuss the problem of spurious tuples and how we may prevent it.

    10.4. State the informal guidelines for relation schema design that we discussed. Illustrate how violation of these guidelines may be harmful.

    10.5. What is a functional dependency? What are the possible sources of the information that defines the functional dependencies that hold among the attributes of a relation schema?

    10.6. Why can we not infer a functional dependency automatically from a particular relation state?

    10.12. What does the term unnormalized relation refer to? How did the normal forms develop historically from first normal form up to Boyce-Codd normal form?

    10.13. Define first, second, and third normal forms when only primary keys are considered. How do the general definitions of 2NF and 3NF, which consider all keys of a relation, differ from those that consider only primary keys?

    10.14. What undesirable dependencies are avoided when a relation is in 2NF?

    10.15. What undesirable dependencies are avoided when a relation is in 3NF?

    10.16. Define Boyce-Codd normal form. How does it differ from 3NF? Why is it considered a stronger form of 3NF? 126