database modeling and design logical design 4th edition data management systems

Database Modeling & Design

Fourth Edition

Teorey.book Page i Saturday, July 16, 2005 12:57 PM

The Morgan Kaufmann Series inData Management Systems

Series Editor: Jim Gray, Microsoft Research

Database Modeling and Design: Logical Design, Fourth Edition

Toby J. Teorey, Sam S. Lightstone, and Thomas P. Nadeau

Joe Celko’s SQL for Smarties: Advanced SQL Programming, Third Edition

Joe Celko

Moving Objects DatabasesRalf Güting and Markus Schneider

Foundations of Multidimensional and Metric Data Structures

Hanan Samet

Joe Celko’s SQL Programming StyleJoe Celko

Data Mining, Second Edition: Concepts and Techniques

Ian Witten and Eibe Frank

Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration

Earl Cox

Data Modeling Essentials, Third EditionGraeme C. Simsion and Graham C. Witt

Location-Based ServicesJochen Schiller and Agnès Voisard

Database Modeling with Microsft® Visio for Enterprise Architects

Terry Halpin, Ken Evans, Patrick Hallock, Bill Maclean

Designing Data-Intensive Web ApplicationsStephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, and Maristella Matera

Mining the Web: Discovering Knowledge from Hypertext Data

Soumen Chakrabarti

Advanced SQL: 1999—Understanding Object-Relational and Other Advanced Features

Jim Melton

Database Tuning: Principles, Experiments, and Troubleshooting Techniques

Dennis Shasha and Philippe Bonnet

SQL:1999—Understanding Relational Language Components

Jim Melton and Alan R. Simon

Information Visualization in Data Mining and Knowledge Discovery

Edited by Usama Fayyad, Georges G. Grinstein, and Andreas Wierse

Transactional Information Systems: Theory, Algorithms, and Practice of Concurrency Control and Recovery

Gerhard Weikum and Gottfried Vossen

Spatial Databases: With Application to GISPhilippe Rigaux, Michel Scholl, and Agnes Voisard

Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design

Terry Halpin

Component Database SystemsEdited by Klaus R. Dittrich and Andreas Geppert

Managing Reference Data in Enterprise Databases: Binding Corporate Data to the Wider World

Malcolm Chisholm

Data Mining: Concepts and TechniquesJiawei Han and Micheline Kamber

Understanding SQL and Java Together: A Guide to SQLJ, JDBC, and Related Technologies

Jim Melton and Andrew Eisenberg

Database: Principles, Programming, and Performance, Second Edition

Patrick and Elizabeth O'Neil

The Object Data Standard: ODMG 3.0Edited by R. G. G. Cattell and Douglas K. Barry

Data on the Web: From Relations to Semistructured Data and XML

Serge Abiteboul, Peter Buneman, and Dan Suciu

Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations

Ian Witten and Eibe Frank

Teorey.book Page ii Saturday, July 16, 2005 12:57 PM

Joe Celko’s Data and Databases: Concepts in Practice

Joe Celko

Developing Time-Oriented Database Applications in SQL

Richard T. Snodgrass

Web Farming for the Data WarehouseRichard D. Hackathorn

Management of Heterogeneous and Autonomous Database Systems

Edited by Ahmed Elmagarmid, Marek Rusinkiewicz, and Amit Sheth

Object-Relational DBMSs: Tracking the Next Great Wave, Second Edition

Michael Stonebraker and Paul Brown,with Dorothy Moore

A Complete Guide to DB2 Universal Database

Don Chamberlin

Universal Database Management: A Guide to Object/Relational Technology

Cynthia Maro Saracco

Readings in Database Systems, Third Edition

Edited by Michael Stonebraker and Joseph M. Hellerstein

Understanding SQL’s Stored Procedures: A Complete Guide to SQL/PSM

Jim Melton

Principles of Multimedia Database SystemsV. S. Subrahmanian

Principles of Database Query Processing for Advanced Applications

Clement T. Yu and Weiyi Meng

Advanced Database SystemsCarlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian, and Roberto Zicari

Principles of Transaction ProcessingPhilip A. Bernstein and Eric Newcomer

Using the New DB2: IBMs Object-Relational Database System

Don Chamberlin

Distributed AlgorithmsNancy A. Lynch

Active Database Systems: Triggers and Rules For Advanced Database Processing

Edited by Jennifer Widom and Stefano Ceri

Migrating Legacy Systems: Gateways, Interfaces, & the Incremental Approach

Michael L. Brodie and Michael Stonebraker

Atomic Transactions

Nancy Lynch, Michael Merritt, William Weihl, and Alan Fekete

Query Processing for Advanced Database Systems

Edited by Johann Christoph Freytag, David Maier, and Gottfried Vossen

Transaction Processing: Concepts and Techniques

Jim Gray and Andreas Reuter

Building an Object-Oriented Database System: The Story of O2

Edited by François Bancilhon, Claude Delobel, and Paris Kanellakis

Database Transaction Models for Advanced Applications

Edited by Ahmed K. Elmagarmid

A Guide to Developing Client/Server SQL Applications

Setrag Khoshafian, Arvola Chan, Anna Wong, and Harry K. T. Wong

The Benchmark Handbook for Database and Transaction Processing Systems, Second Edition

Edited by Jim Gray

Camelot and Avalon: A Distributed Transaction Facility

Edited by Jeffrey L. Eppinger, Lily B. Mummert, and Alfred Z. Spector

Readings in Object-Oriented Database Systems

Edited by Stanley B. Zdonik and David Maier

Teorey.book Page iii Saturday, July 16, 2005 12:57 PM

Teorey.book Page iv Saturday, July 16, 2005 12:57 PM

Database Modeling & Design:

Logical Design

Fourth Edition

TOBY TEOREYSAM LIGHTSTONE

TOM NADEAU

AMSTERDAM • BOSTON • HEIDELBERG • LONDONNEW YORK • OXFORD • PARIS • SAN DIEGO

SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYOMORGAN KAUFMANN PUBLISHERS IS AN IMPRINT OF ELSEVIER

Teorey.book Page v Saturday, July 16, 2005 12:57 PM

Publishing Director Michael ForsterPublisher Diane CerraPublishing Services Manager Simon CrumpEditorial Assistant Asma StephanCover Design Yvo Riezebos DesignCover Image Getty ImagesComposition Multiscience Press, Inc.Technical Illustration Dartmouth Publishing, Inc.Copyeditor Multiscience Press, Inc.Proofreader Multiscience Press, Inc.Indexer Multiscience Press, Inc.Interior printer Maple-Vail Book Manufacturing GroupCover printer Phoenix Color

Morgan Kaufmann Publishers is an imprint of Elsevier.500 Sansome Street, Suite 400, San Francisco, CA 94111

This book is printed on acid-free paper.

© 2006 by Elsevier Inc. All rights reserved.

Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks.In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital orall capital letters. Readers, however, should contact the appropriate companies for more complete information regardingtrademarks and registration.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written permission of the publisher.

Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone:(+44) 1865 843830, fax: (+44) 1865 853333, e-mail: [email protected]. You may also complete your request on-line via the Elsevier homepage (http://elsevier.com) by selecting “Customer Support” and then “Obtaining Permissions.”

Library of Congress Cataloging-in-Publication DataApplication submitted.

ISBN 13: 978-0-12-685352-0ISBN 10: 0-12-685352-5

For information on all Morgan Kaufmann publications,visit our Web site at www.mkp.com or www.books.elsevier.com

Printed in the United States of America05 06 07 08 09 5 4 3 2 1

Teorey.book Page vi Saturday, July 16, 2005 12:57 PM

To Matt, Carol, and Marilyn(Toby Teorey)

To my wife and children, Elisheva, Hodaya, and Avishai(Sam Lightstone)

To Carol, Paula, Mike, and Lagi(Tom Nadeau)

Teorey.book Page vii Saturday, July 16, 2005 12:57 PM

Teorey.book Page viii Saturday, July 16, 2005 12:57 PM

ix

Contents

Preface xv

Chapter 1Introduction 11.1 Data and Database Management 21.2 The Database Life Cycle 31.3 Conceptual Data Modeling 81.4 Summary 111.5 Literature Summary 11

Chapter 2The Entity-Relationship Model 132.1 Fundamental ER Constructs 13

2.1.1 Basic Objects: Entities, Relationships, Attributes 132.1.2 Degree of a Relationship 162.1.3 Connectivity of a Relationship 182.1.4 Attributes of a Relationship 192.1.5 Existence of an Entity in a Relationship 192.1.6 Alternative Conceptual Data Modeling Notations 20

Teorey.book Page ix Saturday, July 16, 2005 12:57 PM

x Contents

2.2 Advanced ER Constructs 232.2.1 Generalization: Supertypes and Subtypes 232.2.2 Aggregation 252.2.3 Ternary Relationships 252.2.4 General n-ary Relationships 282.2.5 Exclusion Constraint 292.2.6 Referential Integrity 30

2.3 Summary 302.4 Literature Summary 31

Chapter 3The Unified Modeling Language (UML) 333.1 Class Diagrams 34

3.1.1 Basic Class Diagram Notation 353.1.2 Class Diagrams for Database Design 373.1.3 Example from the Music Industry 43

3.2 Activity Diagrams 463.2.1 Activity Diagram Notation Description 463.2.2 Activity Diagrams for Workflow 48

3.3 Rules of Thumb for UML Usage 503.4 Summary 513.5 Literature Summary 51

Chapter 4Requirements Analysis and Conceptual Data Modeling 534.1 Introduction 534.2 Requirements Analysis 544.3 Conceptual Data Modeling 55

4.3.1 Classify Entities and Attributes 564.3.2 Identify the Generalization Hierarchies 574.3.3 Define Relationships 584.3.4 Example of Data Modeling: Company Personnel and

Project Database 614.4 View Integration 66

4.4.1 Preintegration Analysis 674.4.2 Comparison of Schemas 684.4.3 Conformation of Schemas 68

Teorey.book Page x Saturday, July 16, 2005 12:57 PM

Contents xi

4.4.4 Merging and Restructuring of Schemas 694.4.5 Example of View Integration 69

4.5 Entity Clustering for ER Models 744.5.1 Clustering Concepts 754.5.2 Grouping Operations 764.5.3 Clustering Technique 78


Chapter 5Transforming the Conceptual Data Model to SQL 835.1 Transformation Rules and SQL Constructs 83

5.1.1 Binary Relationships 855.1.2 Binary Recursive Relationships 905.1.3 Ternary and n-ary Relationships 925.1.4 Generalization and Aggregation 1015.1.5 Multiple Relationships 1035.1.6 Weak Entities 103

5.2 Transformation Steps 1035.2.1 Entity Transformation 1045.2.2 Many-to-Many Binary Relationship Transformation 1045.2.3 Ternary Relationship Transformation 1055.2.4 Example of ER-to-SQL Transformation 105


Chapter 6Normalization 1076.1 Fundamentals of Normalization 107

6.1.1 First Normal Form 1096.1.2 Superkeys, Candidate Keys, and Primary Keys 1096.1.3 Second Normal Form 1116.1.4 Third Normal Form 1136.1.5 Boyce-Codd Normal Form 115

6.2 The Design of Normalized Tables: A Simple Example 1166.3 Normalization of Candidate Tables Derived from

ER Diagrams 118

Teorey.book Page xi Saturday, July 16, 2005 12:57 PM

xii Contents

6.4 Determining the Minimum Set of 3NF Tables 1226.5 Fourth and Fifth Normal Forms 127

6.5.1 Multivalued Dependencies 1276.5.2 Fourth Normal Form 1296.5.3 Decomposing Tables to 4NF 1326.5.4 Fifth Normal Form 133


Chapter 7An Example of Logical Database Design 1397.1 Requirements Specification 139

7.1.1 Design Problems 1407.2 Logical Design 1417.3 Summary 145

Chapter 8Business Intelligence 1478.1 Data Warehousing 148

8.1.1 Overview of Data Warehousing 1488.1.2 Logical Design 152

8.2 Online Analytical Processing (OLAP) 1668.2.1 The Exponential Explosion of Views 1678.2.2 Overview of OLAP 1698.2.3 View Size Estimation 1708.2.4 Selection of Materialized Views 1738.2.5 View Maintenance 1768.2.6 Query Optimization 177

8.3 Data Mining 1788.3.1 Forecasting 1798.3.2 Text Mining 181


Teorey.book Page xii Saturday, July 16, 2005 12:57 PM

Contents xiii

Chapter 9CASE Tools for Logical Database Design 1879.1 Introduction to the CASE Tools 1889.2 Key Capabilities to Watch For 1919.3 The Basics 1929.4 Generating a Database from a Design 1969.5 Database Support 1999.6 Collaborative Support 2009.7 Distributed Development 2019.8 Application Life Cycle Tooling Integration 2029.9 Design Compliance Checking 2049.10 Reporting 2069.11 Modeling a Data Warehouse 2079.12 Semi-Structured Data, XML 2099.13 Summary 2119.14 Literature Summary 211

Appendix: The Basics of SQL 213

Glossary 231

References 239

Exercises 249

Solutions to Selected Exercises 259

About the Authors 263

Index 265

Teorey.book Page xiii Saturday, July 16, 2005 12:57 PM

1

1Introduction

atabase technology has evolved rapidly in the three decades sincethe rise and eventual dominance of relational database systems.

While many specialized database systems (object-oriented, spatial, mul-timedia, etc.) have found substantial user communities in the scienceand engineering fields, relational systems remain the dominant databasetechnology for business enterprises.

Relational database design has evolved from an art to a science thathas been made partially implementable as a set of software design aids.Many of these design aids have appeared as the database component ofcomputer-aided software engineering (CASE) tools, and many of themoffer interactive modeling capability using a simplified data modelingapproach. Logical design—that is, the structure of basic data relation-ships and their definition in a particular database system—is largely thedomain of application designers. These designers can work effectivelywith tools such as ERwin Data Modeler or Rational Rose with UML, aswell as with a purely manual approach. Physical design, the creation ofefficient data storage and retrieval mechanisms on the computing plat-form being used, is typically the domain of the database administrator(DBA). Today’s DBAs have a variety of vendor-supplied tools available tohelp design the most efficient databases. This book is devoted to thelogical design methodologies and tools most popular for relationaldatabases today. Physical design methodologies and tools are covered ina separate book.

D

Teorey.book Page 1 Saturday, July 16, 2005 12:57 PM

2 CHAPTER 1 Introduction

In this chapter, we review the basic concepts of database manage-ment and introduce the role of data modeling and database design inthe database life cycle.

1.1 Data and Database Management

The basic component of a file in a file system is a data item, which is thesmallest named unit of data that has meaning in the real world—forexample, last name, first name, street address, ID number, or politicalparty. A group of related data items treated as a single unit by an applica-tion is called a record. Examples of types of records are order, salesperson,customer, product, and department. A file is a collection of records of asingle type. Database systems have built upon and expanded these defi-nitions: In a relational database, a data item is called a column orattribute; a record is called a row or tuple; and a file is called a table.

A database is a more complex object; it is a collection of interrelatedstored data that serves the needs of multiple users within one or moreorganizations, that is, interrelated collections of many different types oftables. The motivations for using databases rather than files includegreater availability to a diverse set of users, integration of data for easieraccess to and updating of complex transactions, and less redundancy ofdata.

A database management system (DBMS) is a generalized software sys-tem for manipulating databases. A DBMS supports a logical view(schema, subschema); physical view (access methods, data clustering);data definition language; data manipulation language; and importantutilities, such as transaction management and concurrency control, dataintegrity, crash recovery, and security. Relational database systems, thedominant type of systems for well-formatted business databases, alsoprovide a greater degree of data independence than the earlier hierarchi-cal and network (CODASYL) database management systems. Data inde-pendence is the ability to make changes in either the logical or physicalstructure of the database without requiring reprogramming of applica-tion programs. It also makes database conversion and reorganizationmuch easier. Relational DBMSs provide a much higher degree of dataindependence than previous systems; they are the focus of our discus-sion on data modeling.


1.2 The Database Life Cycle 3

1.2 The Database Life Cycle

The database life cycle incorporates the basic steps involved in designinga global schema of the logical database, allocating data across a com-puter network, and defining local DBMS-specific schemas. Once thedesign is completed, the life cycle continues with database implementa-tion and maintenance. This chapter contains an overview of the data-base life cycle, as shown in Figure 1.1. In succeeding chapters, we willfocus on the database design process from the modeling of requirementsthrough logical design (steps I and II below). The result of each step ofthe life cycle is illustrated with a series of diagrams in Figure 1.2. Eachdiagram shows a possible form of the output of each step, so the readercan see the progression of the design process from an idea to actual data-base implementation. These forms are discussed in much more detail inChapters 2 through 6.

I. Requirements analysis. The database requirements are deter-mined by interviewing both the producers and users of data andusing the information to produce a formal requirements specifi-cation. That specification includes the data required for process-ing, the natural data relationships, and the software platform forthe database implementation. As an example, Figure 1.2 (step I)shows the concepts of products, customers, salespersons, andorders being formulated in the mind of the end user during theinterview process.

II. Logical design. The global schema, a conceptual data model dia-gram that shows all the data and their relationships, is devel-oped using techniques such as ER or UML. The data modelconstructs must ultimately be transformed into normalized (glo-bal) relations, or tables. The global schema development meth-odology is the same for either a distributed or centralizeddatabase.

a.Conceptual data modeling. The data requirements are analyzedand modeled using an ER or UML diagram that includes, forexample, semantics for optional relationships, ternary rela-tionships, supertypes, and subtypes (categories). Processingrequirements are typically specified using natural language



expressions or SQL commands, along with the frequency ofoccurrence. Figure 1.2 [step II(a)] shows a possible ER modelrepresentation of the product/customer database in the mindof the end user.

Figure 1.1 The database life cycle

Determine requirements

Model

Information requirements

Integrate views

Transform to SQL tables

[multiple views]

[else]

[else]

[defunct]

[special requirements]

[single view]

Normalize

Select indexes

Denormalize

Implement

Monitor and detect changing requirements

Physical design

Logical design

Implementation



b.View integration. Usually, when the design is large and morethan one person is involved in requirements analysis, multi-ple views of data and relationships result. To eliminate redun-dancy and inconsistency from the model, these views musteventually be “rationalized” (resolving inconsistencies due tovariance in taxonomy, context, or perception) and then con-solidated into a single global view. View integration requiresthe use of ER semantic tools such as identification of syn-onyms, aggregation, and generalization. In Figure 1.2 [step

Figure 1.2 Life cycle results, step-by-step

Step I Requirements Analysis (reality)

Step II Logical design

Products

Customers

Salespersons

Orders

Step II(a) Conceptual data modeling

Step II(b) View integration

Integrationof retailsalesperson’sand customer’sviews

Retailsalespersonview

N

NN

N

N

1

customer

served-by

orders

salesperson

product

sold-by

Customerview

N

NN

N

N

1

1 1

customer places

served-by salesperson fills-out product

fororder

N1customer places order



II(b)], two possible views of the product/customer databaseare merged into a single global view based on common datafor customer and order. View integration is also important forapplication integration.

c.Transformation of the conceptual data model to SQL tables. Basedon a categorization of data modeling constructs and a set ofmapping rules, each relationship and its associated entitiesare transformed into a set of DBMS-specific candidate rela-tional tables. We will show these transformations in stan-dard SQL in Chapter 5. Redundant tables are eliminated aspart of this process. In our example, the tables in step II(c) ofFigure 1.2 are the result of transformation of the integratedER model in step II(b).

d.Normalization of tables. Functional dependencies (FDs) arederived from the conceptual data model diagram and thesemantics of data relationships in the requirements analysis.They represent the dependencies among data elements thatare unique identifiers (keys) of entities. Additional FDs thatrepresent the dependencies among key and nonkey attributeswithin entities can be derived from the requirements specifi-cation. Candidate relational tables associated with all derivedFDs are normalized (i.e., modified by decomposing or split-ting tables into smaller tables) using standard techniques.Finally, redundancies in the data in normalized candidatetables are analyzed further for possible elimination, with theconstraint that data integrity must be preserved. An exampleof normalization of the Salesperson table into the newSalesperson and SalesVacations tables is shown in Figure1.2 from step II(c) to step II(d).

We note here that database tool vendors tend to use theterm logical model to refer to the conceptual data model, andthey use the term physical model to refer to the DBMS-specificimplementation model (e.g., SQL tables). Note also that manyconceptual data models are obtained not from scratch, butfrom the process of reverse engineering from an existing DBMS-specific schema [Silberschatz, Korth, and Sudarshan, 2002].

III. Physical design. The physical design step involves the selec-tion of indexes (access methods), partitioning, and clustering ofdata. The logical design methodology in step II simplifies theapproach to designing large relational databases by reducing the



number of data dependencies that need to be analyzed. This isaccomplished by inserting conceptual data modeling and inte-gration steps [steps II(a) and II(b) of Figure 1.2] into the tradi-

Figure 1.2 (continued)

Step III Physical design

Step II(c) Transformation of the conceptual model to SQL tables

Step II(d) Normalization of SQL tables

Customer

Product

prod-no prod-name qty-in-stock

cust-no

sales-name

sales-name

addr

addr

dept

dept

job-level

job-level job-level

vacation-days

vacation-days

Order-product

order-no prod-no

Order

order-no sales-name cust-no

cust-name . . .

Salesperson

Decomposition of tables and removal of update anomalies

IndexingClusteringPartitioningMaterialized viewsDenormalization

Salesperson Sales-vacations

create table customer (cust_no integer, cust_name char(15), cust_addr char(30), sales_name char(15), prod_no integer, primary key (cust_no), foreign key (sales_name) references salesperson foreign key (prod_no) references product);



tional relational design approach. The objective of these steps isan accurate representation of reality. Data integrity is preservedthrough normalization of the candidate tables created when theconceptual data model is transformed into a relational model.The purpose of physical design is to optimize performance asclosely as possible.

As part of the physical design, the global schema can some-times be refined in limited ways to reflect processing (query andtransaction) requirements if there are obvious, large gains to bemade in efficiency. This is called denormalization. It consists ofselecting dominant processes on the basis of high frequency,high volume, or explicit priority; defining simple extensions totables that will improve query performance; evaluating total costfor query, update, and storage; and considering the side effects,such as possible loss of integrity. This is particularly importantfor Online Analytical Processing (OLAP) applications.

IV. Database implementation, monitoring, and modifica-tion. Once the design is completed, the database can be createdthrough implementation of the formal schema using the datadefinition language (DDL) of a DBMS. Then the data manipula-tion language (DML) can be used to query and update the data-base, as well as to set up indexes and establish constraints, suchas referential integrity. The language SQL contains both DDLand DML constructs; for example, the create table command rep-resents DDL, and the select command represents DML.

As the database begins operation, monitoring indicateswhether performance requirements are being met. If they arenot being satisfied, modifications should be made to improveperformance. Other modifications may be necessary whenrequirements change or when the end users’ expectationsincrease with good performance. Thus, the life cycle continueswith monitoring, redesign, and modifications. In the next twochapters we look first at the basic data modeling concepts andthen—starting in Chapter 4—we apply these concepts to thedatabase design process.

1.3 Conceptual Data Modeling

Conceptual data modeling is the driving component of logical databasedesign. Let us take a look at how this component came about, and why


1.3 Conceptual Data Modeling 9

it is important. Schema diagrams were formalized in the 1960s byCharles Bachman. He used rectangles to denote record types anddirected arrows from one record type to another to denote a one-to-many relationship among instances of records of the two types. Theentity-relationship (ER) approach for conceptual data modeling, one ofthe two approaches emphasized in this book and described in detail inChapter 2, was first presented in 1976 by Peter Chen. The Chen form ofthe ER model uses rectangles to specify entities, which are somewhatanalogous to records. It also uses diamond-shaped objects to representthe various types of relationships, which are differentiated by numbersor letters placed on the lines connecting the diamonds to the rectangles.

The Unified Modeling Language (UML) was introduced in 1997 byGrady Booch and James Rumbaugh and has become a standard graphi-cal language for specifying and documenting large-scale software sys-tems. The data modeling component of UML (now UML-2) has a greatdeal of similarity with the ER model and will be presented in detail inChapter 3. We will use both the ER model and UML to illustrate the datamodeling and logical database design examples throughout this book.

In conceptual data modeling, the overriding emphasis is on simplic-ity and readability. The goal of conceptual schema design, where the ERand UML approaches are most useful, is to capture real-world datarequirements in a simple and meaningful way that is understandable byboth the database designer and the end user. The end user is the personresponsible for accessing the database and executing queries and updatesthrough the use of DBMS software, and therefore has a vested interest inthe database design process.

The ER model has two levels of definition—one that is quite simpleand another that is considerably more complex. The simple level is theone used by most current design tools. It is quite helpful to the databasedesigner who must communicate with end users about their data require-ments. At this level you simply describe, in diagram form, the entities,attributes, and relationships that occur in the system to be conceptual-ized, using semantics that are definable in a data dictionary. Specializedconstructs, such as “weak” entities or mandatory/optional existencenotation, are also usually included in the simple form. But very little elseis included, to avoid cluttering up the ER diagram while the designer’sand end user’s understandings of the model are being reconciled.

An example of a simple form of ER model using the Chen notation isshown in Figure 1.3. In this example, we want to keep track of video-tapes and customers in a video store. Videos and customers are repre-sented as entities Video and Customer, and the relationship “rents”



shows a many-to-many association between them. Both Video and Cus-tomer entities have a few attributes that describe their characteristics,and the relationship “rents” has an attribute due date that representsthe date that a particular video rented by a specific customer must bereturned.

From the database practitioner’s standpoint, the simple form of theER model (or UML) is the preferred form for both data modeling and enduser verification. It is easy to learn and applicable to a wide variety ofdesign problems that might be encountered in industry and small busi-nesses. As we will demonstrate, the simple form can be easily translatedinto SQL data definitions, and thus it has an immediate use as an aid fordatabase implementation.

The complex level of ER model definition includes concepts that gowell beyond the simple model. It includes concepts from the semanticmodels of artificial intelligence and from competing conceptual datamodels. Data modeling at this level helps the database designer capturemore semantics without having to resort to narrative explanations. It isalso useful to the database application programmer, because certainintegrity constraints defined in the ER model relate directly to code—code that checks range limits on data values and null values, for exam-ple. However, such detail in very large data model diagrams actuallydetracts from end user understanding. Therefore, the simple level isrecommended as the basic communication tool for database designverification.

Figure 1.3 A simple form of ER model using the Chen notation

due-datecust-id

cust-name

N NCustomer Video

video-id

copy-no

title

rents


1.4 Summary 11

1.4 Summary

Knowledge of data modeling and database design techniques is impor-tant for database practitioners and application developers. The databaselife cycle shows the steps needed in a methodical approach to designinga database,, from logical design, which is independent of the systemenvironment, to physical design, which is based on the details of thedatabase management system chosen to implement the database.Among the variety of data modeling approaches, the ER and UML datamodels are arguably the most popular ones in use today, due to theirsimplicity and readability. A simple form of these models is used in mostdesign tools; it is easy to learn and to apply to a variety of industrial andbusiness applications. It is also a very useful tool for communicatingwith the end user about the conceptual model and for verifying theassumptions made in the modeling process. A more complex form, asuperset of the simple form, is useful for the more experienced designerwho wants to capture greater semantic detail in diagram form, whileavoiding having to write long and tedious narrative to explain certainrequirements and constraints.

1.5 Literature Summary

Much of the early data modeling work was done by Bachman [1969,1972], Chen [1976], Senko et al. [1973], and others. Database designtextbooks that adhere to a significant portion of the relational databaselife cycle described in this chapter are Teorey and Fry [1982], Muller[1999], Stephens and Plew [2000], Simsion and Witt [2001], and Hernan-dez and Getz [2003]. Temporal (time-varying) databases are defined anddiscussed in Jensen and Snodgrass [1996] and Snodgrass [2000]. Otherwell used approaches for conceptual data modeling include IDEF1X[Bruce, 1992; IDEF1X, 2005] and the data modeling component of theZachmann Framework [Zachmann, 1987; Zachmann Institute for Frame-work Advancement, 2005]. Schema evolution during development, afrequently occurring problem, is addressed in Harriman, Hodgetts, andLeo [2004].


13

2The Entity-Relationship Model

his chapter defines all the major entity-relationship (ER) conceptsthat can be applied to the conceptual data modeling phase of the

database life cycle. In Section 2.1, we will look at the simple level of ERmodeling described in the original work by Chen and extended by oth-ers. The simple form of the ER model is used as the basis for effectivecommunication with the end user about the conceptual database. Sec-tion 2.2 presents the more advanced concepts; although they are lessgenerally accepted, they are useful to describe certain semantics thatcannot be constructed with the simple model.

2.1 Fundamental ER Constructs

2.1.1 Basic Objects: Entities, Relationships, Attributes

The basic ER model consists of three classes of objects: entities, relation-ships, and attributes.

Entities

Entities are the principal data objects about which information is to becollected; they usually denote a person, place, thing, or event of infor-mational interest. A particular occurrence of an entity is called an entityinstance or sometimes an entity occurrence. In our example, employee,

T


14 CHAPTER 2 The Entity-Relationship Model

department, division, project, skill, and location are all examples of enti-ties. For easy reference, entity names will henceforth be capitalizedthroughout this text (e.g., Employee, Department, and so forth). Theentity construct is a rectangle as depicted in Figure 2.1. The entity nameis written inside the rectangle.

Relationships

Relationships represent real-world associations among one or more enti-ties, and, as such, have no physical or conceptual existence other thanthat which depends upon their entity associations. Relationships aredescribed in terms of degree, connectivity, and existence. These termsare defined in the sections that follow. The most common meaning asso-ciated with the term relationship is indicated by the connectivitybetween entity occurrences: one-to-one, one-to-many, and many-to-many. The relationship construct is a diamond that connects the associ-

Figure 2.1 The basic ER model

Representation & Example

Attribute

Weak entity

Concept

Entity

identifier (key)

descriptor (nonkey)

Relationship

address

street

state

zip-code

city

multivalued descriptor

complex attribute

works-in

Employee-job-history

Employee

emp-name

emp-id

degrees


2.1 Fundamental ER Constructs 15

ated entities, as shown in Figure 2.1. The relationship name can be writ-ten inside or just outside the diamond.

A role is the name of one end of a relationship when each end needsa distinct name for clarity of the relationship. In most of the examplesgiven in Figure 2.2, role names are not required because the entitynames combined with the relationship name clearly define the individ-ual roles of each entity in the relationship. However, in some cases rolenames should be used to clarify ambiguities. For example, in the firstcase in Figure 2.2, the recursive binary relationship “manages” uses tworoles, “manager” and “subordinate,” to associate the proper connectivi-ties with the two different roles of the single entity. Role names are typi-cally nouns. In this diagram, one employee’s role is to be the “manager”of up to n other employees. The other role is for particular “subordi-nates” to be managed by exactly one other employee.

Attributes

Attributes are characteristics of entities that provide descriptive detailabout them. A particular occurrence of an attribute within an entity orrelationship is called an attribute value. Attributes of an entity such asEmployee may include emp-id, emp-name, emp-address, phone-no, fax-no, job-title, and so on. The attribute construct is an ellipse with theattribute name inside (or an oblong, as shown in Figure 2.1). Theattribute is connected to the entity it characterizes.

There are two types of attributes: identifiers and descriptors. An iden-tifier (or key) is used to uniquely determine an instance of an entity; adescriptor (or nonkey attribute) is used to specify a nonunique charac-teristic of a particular entity instance. Both identifiers and descriptorsmay consist of either a single attribute or some composite of attributes.For example, an identifier or key of Employee is emp-id, and a descriptorof Employee is emp-name or job-title. Key attributes are underlined inthe ER diagram, as shown in Figure 2.1. We note, briefly, that you canhave more than one identifier (key) for an entity, or you can have a setof attributes that compose a key (see Section 6.1.2).

Some attributes, such as specialty-area, may be multivalued. Thenotation for multivalued attributes is a double attachment line, asshown in Figure 2.1. Other attributes may be complex, such as anaddress that further subdivides into street, city, state, and zip code. Com-plex attributes are constructed to have attributes of their own; some-times, however, the individual parts of a complex attribute are specifiedas individual attributes in the first place. Either form is reasonable in ERnotation.



Entities have internal identifiers that uniquely determine the exist-ence of entity instances, but weak entities derive their identity from theidentifying attributes of one or more “parent” entities. Weak entities areoften depicted with a double-bordered rectangle (see Figure 2.1), whichdenotes that all occurrences of that entity depend on an associated(strong) entity for their existence in the database. For example, in Figure2.1, the weak entity Employee-job-history is related to the entityEmployee and dependent upon Employee for its own existence.

2.1.2 Degree of a Relationship

The degree of a relationship is the number of entities associated in therelationship. Binary and ternary relationships are special cases where thedegrees are 2 and 3, respectively. An n-ary relationship is the generalform for any degree n. The notation for degree is illustrated in Figure 2.2.The binary relationship, an association between two entities, is by farthe most common type in the natural world. In fact, many modelingsystems use only this type. In Figure 2.2, we see many examples of theassociation of two entities in different ways: Department and Division,Department and Employee, Employee and Project, and so on. A binaryrecursive relationship (for example, “manages” in Figure 2.2) relates aparticular Employee to another Employee by management. It is calledrecursive because the entity relates only to another instance of its owntype. The binary recursive relationship construct is a diamond with bothconnections to the same entity.

A ternary relationship is an association among three entities. Thistype of relationship is required when binary relationships are not suffi-cient to accurately describe the semantics of the association. The ternaryrelationship construct is a single diamond connected to three entities, asshown in Figure 2.2. Sometimes a relationship is mistakenly modeled asternary when it could be decomposed into two or three equivalent binaryrelationships. When this occurs, the ternary relationship should be elimi-nated to achieve both simplicity and semantic purity. Ternary relation-ships are discussed in greater detail in Section 2.2.3 and Chapter 6.

An entity may be involved in any number of relationships, and eachrelationship may be of any degree. Furthermore, two entities may haveany number of binary relationships between them, and so on for any nentities (see n-ary relationships defined in Section 2.2.4).



Figure 2.2 Degrees, connectivity, and attributes of a relationship

Representation & Example

Degree

recursivebinary

binary

ternary

one-to-one

one-to-many

many-to-many

optional

mandatory

Existence

Connectivity

Department

Department

Department

Office

Employee

Employee

Employee

Employee

Employee

Project

is-managed-

by

has

is-managed-

by

is-occupied-

by

works-on

1

1

1 1

1

N N

N

N

1

Concept

Employee

Department Division

Project

Skill

EmployeeN

N

N

N

manages

manager

subordinate

is-subunit-

of

uses

1

1

N

task-assignment

start-date



2.1.3 Connectivity of a Relationship

The connectivity of a relationship describes a constraint on the connec-tion of the associated entity occurrences in the relationship. Values forconnectivity are either “one” or “many.” For a relationship between theentities Department and Employee, a connectivity of one for Depart-ment and many for Employee means that there is at most one entityoccurrence of Department associated with many occurrences ofEmployee. The actual count of elements associated with the connectiv-ity is called the cardinality of the relationship connectivity; it is usedmuch less frequently than the connectivity constraint because the actualvalues are usually variable across instances of relationships. Note thatthere are no standard terms for the connectivity concept, so the reader isadmonished to consider the definition of these terms carefully whenusing a particular database design methodology.

Figure 2.2 shows the basic constructs for connectivity for binary rela-tionships: one-to-one, one-to-many, and many-to-many. On the “one”side, the number one is shown on the connection between the relation-ship and one of the entities, and on the “many” side, the letter N is usedon the connection between the relationship and the entity to designatethe concept of many.

In the one-to-one case, the entity Department is managed by exactlyone Employee, and each Employee manages exactly one Department.Therefore, the minimum and maximum connectivities on the “is-man-aged-by” relationship are exactly one for both Department andEmployee.

In the one-to-many case, the entity Department is associated with(“has”) many Employees. The maximum connectivity is given on theEmployee (many) side as the unknown value N, but the minimum con-nectivity is known as one. On the Department side the minimum andmaximum connectivities are both one, that is, each Employee workswithin exactly one Department.

In the many-to-many case, a particular Employee may work onmany Projects and each Project may have many Employees. We see thatthe maximum connectivity for Employee and Project is N in both direc-tions, and the minimum connectivities are each defined (implied) asone.

Some situations, though rare, are such that the actual maximumconnectivity is known. For example, a professional basketball team maybe limited by conference rules to 12 players. In such a case, the number



12 could be placed next to an entity called “team members” on themany side of a relationship with an entity “team.” Most situations, how-ever, have variable connectivity on the many side, as shown in all theexamples of Figure 2.2.

2.1.4 Attributes of a Relationship

Attributes can be assigned to certain types of relationships as well as toentities. An attribute of a many-to-many relationship, such as the“works-on” relationship between the entities Employee and Project (Fig-ure 2.2), could be “task-assignment” or “start-date.” In this case, a giventask assignment or start date only has meaning when it is common to aninstance of the assignment of a particular Employee to a particularProject via the relationship “works-on.”

Attributes of relationships are typically assigned only to binarymany-to-many relationships and to ternary relationships. They are notnormally assigned to one-to-one or one-to-many relationships, becauseof potential ambiguities. For example, in the one-to-one binary relation-ship “is-managed-by” between Department and Employee, an attribute“start-date” could be applied to Department to designate the start datefor that department. Alternatively, it could be applied to Employee as anattribute for each Employee instance, to designate the employee’s startdate as the manager of that department. If, instead, the relationship ismany-to-many, so that an employee can manage many departmentsover time, then the attribute “start-date” must shift to the relationship,so each instance of the relationship that matches one employee withone department can have a unique start date for that employee as man-ager of that department.

2.1.5 Existence of an Entity in a Relationship

Existence of an entity occurrence in a relationship is defined as eithermandatory or optional. If an occurrence of either the “one” or “many”side entity must always exist for the entity to be included in the relation-ship, then it is mandatory. When an occurrence of that entity need notalways exist, it is considered optional. For example, in Figure 2.2 theentity Employee may or may not be the manager of any Department,thus making the entity Department in the “is-managed-by” relationshipbetween Employee and Department optional.



Optional existence, defined by a zero on the connection line betweenan entity and a relationship, defines a minimum connectivity of zero.Mandatory existence defines a minimum connectivity of one. When exist-ence is unknown, we assume the minimum connectivity is one (that is,mandatory).

Maximum connectivities are defined explicitly on the ER diagramas a constant (if a number is shown on the ER diagram next to anentity) or a variable (by default if no number is shown on the ER dia-gram next to an entity). For example, in Figure 2.2, the relationship“is-occupied-by” between the entity Office and Employee implies thatan Office may house from zero to some variable maximum (N) numberof Employees, but an Employee must be housed in exactly one Office,that is, mandatory.

Existence is often implicit in the real world. For example, an entityEmployee associated with a dependent (weak) entity, Dependent, can-not be optional, but the weak entity is usually optional. Using the con-cept of optional existence, an entity instance may be able to exist inother relationships even though it is not participating in this particularrelationship.

The term existence is also associated with identification of a dataobject. Many DBMSs provide unique identifiers for rows (Oracle ROW-IDs, for example). Identifying an object such as a row can be done in anexistence-based way. It can also be done in a value-based way by identi-fying the object (row) with the values of one or more attributes or col-umns of the table.

2.1.6 Alternative Conceptual Data Modeling Notations

At this point we need to digress briefly to look at other conceptual datamodeling notations that are commonly used today and compare themwith the Chen approach. A popular alternative form for one-to-manyand many-to-many relationships uses “crow’s-foot” notation for the“many” side (see Figure 2.3a). This form is used by some CASE tools,such as Knowledgeware’s Information Engineering Workbench (IEW).Relationships have no explicit construct but are implied by the connec-tion line between entities and a relationship name on the connectionline. Minimum connectivity is specified by either a 0 (for zero) or perpen-dicular line (for one) on the connection lines between entities. The termintersection entity is used to designate a weak entity, especially an entitythat is equivalent to a many-to-many relationship. Another popular form



used today is the IDEFIX notation [IDEF1X, 2005], conceived by RobertG. Brown [Bruce, 1992]. The similarities with the Chen notation areobvious in Figure 2.3b. Fortunately, any of these forms is reasonably easyto learn and read, and their equivalence with the basic ER concepts isobvious from the diagrams. Without a clear standard for the ER model,however, many other constructs are being used today in addition to thethree types shown here.

(a)

Figure 2.3 Conceptual data modeling notations

ER model constructsusing the

Chen notation

ER model constructs using the"crow's-foot” approach

[Knowledgeware]

Department

Division has Department

Department

Division

Office Employee

Employee

Recursive binary relationship Recursive entity

Project

Department

Employeeis-

managed-by

is-occupied-

byis-

occupied-by

works-on

works-on

11

1

Office

Employee

Employee1


weak entity

ProjectN

N

N

N

Employee


intersection entity

is-group-leader-of

has

max = 1min = 0

min = 1max = 1

is-managed-

by

Employee

Employee

1 Nis-

group-leader-of



(b)


(f) Recursive binary relationship

ER model constructsusing the

Chen notation

ER model constructsusing IDEF1X

[Bruc92]

Department

Division has Department

Department

Division

Office Employee

Employee Project

Employeeis-

managed-by

is-managed-by

is-occupied-

by

is-occupied-by

works-on

works-on

11

1

Office

Employee

Employee1

ProjectM

N

N

N

N

Employee

has

Employee

Employee

EMPLOYEE

emp-id

emp-namejob-class

is-group-leader-of

Employee

1

P

is-group-leader-

of

Entity,attribute(no operation)

Entity

Primary key

emp-id

emp-name

job-class(a) Entity with attributes

(b) One-to-one

(c) One-to-many, many side optional

(d) One-to-many, one side optional

(e) Many-to-many

Nonprimarykey attributes

Department


2.2 Advanced ER Constructs 23

2.2 Advanced ER Constructs

2.2.1 Generalization: Supertypes and Subtypes

The original ER model has been effectively used for communicating fun-damental data and relationship definitions with the end user for a longtime. However, using it to develop and integrate conceptual models withdifferent end user views was severely limited until it could be extendedto include database abstraction concepts such as generalization. The gen-eralization relationship specifies that several types of entities with cer-tain common attributes can be generalized into a higher-level entitytype—a generic or superclass entity, more commonly known as a super-type entity. The lower levels of entities—subtypes in a generalization hier-archy—can be either disjoint or overlapping subsets of the supertypeentity. As an example, in Figure 2.4 the entity Employee is a higher-level

Figure 2.4 Supertypes and subtypes

supertypeEmployee

d

subtypes

SecretaryTechnicianEngineerManager

Individual

CustomerEmployee

(a) Generalization with disjoint subtypes

(b) Generalization with overlapping subtypes and completeness constraint

O



abstraction of Manager, Engineer, Technician, and Secretary—all ofwhich are disjoint types of Employee. The ER model construct for thegeneralization abstraction is the connection of a supertype entity withits subtypes, using a circle and the subset symbol on the connectinglines from the circle to the subtype entities. The circle contains a letterspecifying a disjointness constraint (see the following discussion). Spe-cialization, the reverse of generalization, is an inversion of the same con-cept; it indicates that subtypes specialize the supertype.

A supertype entity in one relationship may be a subtype entity inanother relationship. When a structure comprises a combination ofsupertype/subtype relationships, that structure is called a supertype/sub-type hierarchy or generalization hierarchy. Generalization can also bedescribed in terms of inheritance, which specifies that all the attributesof a supertype are propagated down the hierarchy to entities of a lowertype. Generalization may occur when a generic entity, which we call thesupertype entity, is partitioned by different values of a commonattribute. For example, in Figure 2.4 the entity Employee is a generaliza-tion of Manager, Engineer, Technician, and Secretary over the attribute“job-title” in Employee.

Generalization can be further classified by two important constraintson the subtype entities: disjointness and completeness. The disjointnessconstraint requires the subtype entities to be mutually exclusive. Wedenote this type of constraint by the letter “d” written inside the gener-alization circle (Figure 2.4a). Subtypes that are not disjoint (i.e., thatoverlap) are designated by using the letter “o” inside the circle. As anexample, the supertype entity Individual has two subtype entities,Employee and Customer; these subtypes could be described as overlap-ping, or not mutually exclusive (Figure 2.4b). Regardless of whether thesubtypes are disjoint or overlapping, they may have additional specialattributes in addition to the generic (inherited) attributes from thesupertype.

The completeness constraint requires the subtypes to be all-inclu-sive of the supertype. Thus, subtypes can be defined as either total orpartial coverage of the supertype. For example, in a generalization hier-archy with supertype Individual and subtypes Employee and Customer,the subtypes may be described as all-inclusive or total. We denote thistype of constraint by a double line between the supertype entity andthe circle. This is indicated in Figure 2.4b, which implies that the onlytypes of individuals to be considered in the database are employees andcustomers.



2.2.2 Aggregation

Aggregation is a form of abstraction between a supertype and subtypeentity that is significantly different from the generalization abstraction.Generalization is often described in terms of an “is-a” relationshipbetween the subtype and the supertype—for example, an Employee is anIndividual. Aggregation, on the other hand, is the relationship betweenthe whole and its parts and is described as a “part-of” relationship—forexample, a report and a prototype software package are both parts of adeliverable for a contract. Thus, in Figure 2.5, the entity Software-prod-uct is seen to consist of component parts Program and User’s Guide. Theconstruct for aggregation is similar to generalization, in that the super-type entity is connected with the subtype entities with a circle; in thiscase, the letter “A” is shown in the circle. However, there are no subsetsymbols because the “part-of” relationship is not a subset. Furthermore,there are no inherited attributes in aggregation; each entity has its ownunique set of attributes.

2.2.3 Ternary Relationships

Ternary relationships are required when binary relationships are not suf-ficient to accurately describe the semantics of an association amongthree entities. Ternary relationships are somewhat more complex thanbinary relationships, however. The ER notation for a ternary relationshipis shown in Figure 2.6 with three entities attached to a single relation-ship diamond, and the connectivity of each entity is designated as either“one” or “many.” An entity in a ternary relationship is considered to be“one” if only one instance of it can be associated with one instance of

Figure 2.5 Aggregation

Software-product

A

User’s GuideProgram



each of the other two associated entities. It is “many” if more than oneinstance of it can be associated with one instance of each of the othertwo associated entities. In either case, it is assumed that one instance ofeach of the other entities is given.

As an example, the relationship “manages” in Figure 2.6c associatesthe entities Manager, Engineer, and Project. The entities Engineer and

Figure 2.6 Ternary relationships

Each employee assigned to a project works atonly one location for that project, but can beat different locations for different projects. Ata particular location, an employee works on only one project. At a particular location, there can be many employees assigned to a given project.

A technician uses exactly one notebook for each project. Each notebook belongs to one technician for each project. Note that a technician may still work on many projects and maintain different notebooks for different projects.

ProjectTechnician

Notebook

1

1 1uses-notebook

Functional dependencies

(a) One-to-one-to-one ternary relationship

emp-id, project-name notebook-noemp-id, notebook-no project-nameproject-name, notebook-no emp-id

→→

→

EmployeeProject

Location

1

1 Nassigned-

to


(b) One-to-one-to-many ternary relationship

emp-id, loc-name project-nameemp-id, project-name loc-name

→→



Project are considered “many;” the entity Manager is considered “one.”This is represented by the following assertions.

• Assertion 1: One engineer, working under one manager, couldbe working on many projects.

• Assertion 2: One project, under the direction of one manager,could have many engineers.

• Assertion 3: One engineer, working on one project, must haveonly a single manager


Each engineer working on a particularproject has exactly one manager, buteach manager of a project may managemany engineers, and each manager ofan engineer may manage that engineeron many projects.

EngineerManager

Project

N

1 Nmanages

Functional dependency

(c) One-to-many-to-many ternary relationship

project-name, emp-id mgr-id→

ProjectEmployee

Skill

N

N Nskill-used

Functional dependenciesEmployees can use many skills on any oneof many projects, and each project hasmany employees with various skills.

(d) Many-to-many-to-many ternary relationship

None



Assertion 3 could also be written in another form, using an arrow(->) in a kind of shorthand called a functional dependency. For example:

emp-id, project-name -> mgr-id

where emp-id is the key (unique identifier) associated with the entityEngineer, project-name is the key associated with the entity Project, andmgr-id is the key of the entity Manager. In general, for an n-ary relation-ship, each entity considered to be a “one” has its key appearing on theright side of exactly one functional dependency (FD). No entity consid-ered “many” ever has its key appear on the right side of an FD.

All four forms of ternary relationships are illustrated in Figure 2.6. Ineach case, the number of “one” entities implies the number of FDs usedto define the relationship semantics, and the key of each “one” entityappears on the right side of exactly one FD for that relationship.

Ternary relationships can have attributes in the same way thatmany-to-many binary relationships can. The values of these attributesare uniquely determined by some combination of the keys of the entitiesassociated with the relationship. For example, in Figure 2.6d the rela-tionship “skill-used” might have the attribute “tool” associated with agiven employee using a particular skill on a certain project, indicatingthat a value for tool is uniquely determined by the combination ofemployee, skill, and project.

2.2.4 General n-ary Relationships

Generalizing the ternary form to higher-degree relationships, an n-aryrelationship that describes some association among n entities is repre-sented by a single relationship diamond with n connections, one to each

Figure 2.7 n-ary relationships.

Day

enrolls-in ClassStudent

TimeRoom



entity (see Figure 2.7). The meaning of this form can best be described interms of the functional dependencies among the keys of the n associatedentities. There can be anywhere from zero to n FDs, depending on thenumber of “one” entities. The collection of FDs that describe an n-aryrelationship must have n components: n – 1 on the left side (determi-nant) and 1 on the right side. A ternary relationship (n = 3), for example,has two components on the left and one on the right, as we saw in theexample in Figure 2.6. In a more complex database, other types of FDsmay also exist within an n-ary relationship. When this occurs, the ERmodel does not provide enough semantics on its own, and it must besupplemented with a narrative description of these dependencies.

2.2.5 Exclusion Constraint

The normal, or default, treatment of multiple relationships is the inclu-sive OR, which allows any or all of the entities to participate. In some sit-uations, however, multiple relationships may be affected by the exclusiveOR (exclusion) constraint, which allows at most one entity instanceamong several entity types to participate in the relationship with a sin-gle root entity. For example, in Figure 2.8, suppose the root entity Work-task has two associated entities, External-project and Internal-project. Atmost one of the associated entity instances could apply to an instance ofWork-task.

Figure 2.8 Exclusion constraint

External-project

is-foris-

assigned-to

Work-task

Internal-project

+

A work task can be assigned toeither an external project or aninternal project, but not both.



2.2.6 Referential Integrity

We note that a foreign key is an attribute of a table (not necessarily a keyof any kind) that relates to a key in another table. Referential integrityrequires that for every foreign key instance that exists in a table, the row(and thus the key instance) of the parent table associated with that for-eign key instance must also exist. The referential integrity constraint hasbecome integral to relational database design and is usually implied asrequirements for the resulting relational database implementation.(Chapter 5 discusses the SQL implementation of referential integrityconstraints.)

2.3 Summary

The basic concepts of the ER model and their constructs are described inthis chapter. An entity is a person, place, thing, or event of informa-tional interest. Attributes are objects that provide descriptive informa-tion about entities. Attributes may be unique identifiers or nonuniquedescriptors. Relationships describe the connectivity between entityinstances: one-to-one, one-to-many, or many-to-many. The degree of arelationship is the number of associated entities: two (binary), three (ter-nary), or any n (n-ary). The role (name), or relationship name, definesthe function of an entity in a relationship.

The concept of existence in a relationship determines whether anentity instance must exist (mandatory) or not (optional). So, for exam-ple, the minimum connectivity of a binary relationship—that is, thenumber of entity instances on one side that are associated with oneinstance on the other side—can either be zero, if optional, or one, ifmandatory. The concept of generalization allows for the implementa-tion of supertype and subtype abstractions.

The more advanced constructs in ER diagrams are sporadically usedand have no generally accepted form as yet. They include ternary rela-tionships, which we define in terms of the FD concept of relational data-bases; constraints on exclusion; and the implicit constraints from therelational model, such as referential integrity.


2.4 Literature Summary 31


Most of the notation in this chapter is from Chen’s original ER defini-tion [Chen, 1976]. The concept of data abstraction was first proposed bySmith and Smith [1977] and applied to the ER model by Scheuermann,Scheffner, and Weber [1980], Elmasri and Navathe [2003], Bruce [1992],IDEF1X [2005], among others. The application of the semantic networkmodel to conceptual schema design was shown by Bachman [1977],McKleod and King [1979], Hull and King [1987], and Peckham andMaryanski [1988].


33

3The Unified Modeling Language (UML)

he Unified Modeling Language (UML) is a graphical language forcommunicating design specifications for software. The object-ori-

ented software development community created UML to meet the spe-cial needs of describing object-oriented software design. UML has growninto a standard for the design of digital systems in general.

There are a number of different types of UML diagrams serving vari-ous purposes [Rumb05]. The class and activity diagram types are particu-larly useful for discussing database design issues. UML class diagramscapture the structural aspects found in database schemas. UML activitydiagrams facilitate discussion on the dynamic processes involved indatabase design. This chapter is an overview of the syntax and semanticsof the UML class and activity diagram constructs used in this book.These same concepts are useful for planning, documenting, discussingand implementing databases. We are using UML 2.0, although for thepurposes of the class diagrams and activity diagrams shown in this book,if you are familiar with UML 1.4 or 1.5 you will probably not see any dif-ferences.

UML class diagrams and entity-relationship (ER) models [Chen,1976; Chen, 1987] are similar in both form and semantics. The originalcreators of UML point out the influence of ER models on the origins ofclass diagrams [Rumbaugh, Jacobson, and Booch, 2005]. The influenceof UML has in turn affected the database community. Class diagrams

T


34 CHAPTER 3 The Unified Modeling Language (UML)

now appear frequently in the database literature to describe databaseschemas.

UML activity diagrams are similar in purpose to flowcharts. Processesare partitioned into constituent activities along with control flow speci-fications.

This chapter is organized into three sections. Section 3.l presentsclass diagram notation, along with examples. Section 3.2 covers activitydiagram notation, along with illustrative examples. Section 3.3 con-cludes with a few rules of thumb for UML usage.

3.1 Class Diagrams

A class is a descriptor for a set of objects that share some attributes and/or operations. We conceptualize classes of objects in our everyday lives.For example, a car has attributes, such as vehicle identification number(VIN) and mileage. A car also has operations, such as accelerate andbrake. All cars have these attributes and operations. Individual cars dif-fer in the details. A given car has its own values for VIN and mileage.For example, a given car might have VIN 1NXBR32ES3Z126369 and amileage of 22,137 miles. Individual cars are objects that are instances ofthe class “Car.”

Classes and objects are a natural way of conceptualizing the worldaround us. The concepts of classes and objects are also the paradigmsthat form the foundation of object-oriented programming. The develop-ment of object-oriented programming led to the need for a language todescribe object-oriented design, giving rise to UML.

There is a close correspondence between class diagrams in UML andER diagrams. Classes are analogous to entities. Database schemas canbe diagrammed using UML. It is possible to conceptualize a databasetable as a class. The columns in the table are the attributes, and therows are objects of that class. For example, we could have a tablenamed “Car” with columns named “vin” and “mileage.” Each row inthe table would have values for these columns, representing an indi-vidual car. A given car might be represented by a row with the value“1NXBR32ES3Z126369” in the vin column, and 22,137 in the mileagecolumn.

The major difference between classes and entities is the lack of oper-ations in entities. Note that the term operation is used here in the UMLsense of the word. Stored procedures, functions, triggers, and constraintsare forms of named behavior that can be defined in databases; however


3.1 Class Diagrams 35

these are not associated with the behavior of individual rows. The termoperations in UML refers to the methods inherent in classes of objects.These behaviors are not stored in the definition of rows within the data-base. There are no operations named “accelerate” or “brake” associatedwith rows in our “Car” table. Classes can be shown with attributes andno operations in UML, which is the typical usage for database schemas.

3.1.1 Basic Class Diagram Notation

The top of Figure 3.1 illustrates the UML syntax for a class, showingboth attributes and operations. It is also possible to include user-definednamed compartments, such as responsibilities. We will focus on the classname, attributes, and operations compartments. The UML icon for aclass is a rectangle. When the class is shown with attributes and opera-tions, the rectangle is subdivided into three horizontal compartments.The top compartment contains the class name, centered in bold face,beginning with a capital letter. Typically, class names are nouns. Themiddle compartment contains attribute names, left-justified in regularface, beginning with a lowercase letter. The bottom compartment con-tains operation names, left-justified in regular face, beginning with alowercase letter, ending with a parenthetical expression. The parenthesesmay contain arguments for the operation.

The class notation has some variations reflecting emphasis. Classescan be written without the attribute compartment and/or the operationscompartment. Operations are important in software. If the softwaredesigner wishes to focus on the operations, the class can be shown withonly the class name and operations compartments. Showing operationsand hiding attributes is a very common syntax used by software design-ers. Database designers, on the other hand, do not generally deal withclass operations; however, the attributes are of paramount importance.The needs of the database designer can be met by writing the class withonly the class name and attribute compartments showing. Hiding opera-tions and showing attributes is an uncommon syntax for a softwaredesigner, but it is common for database design. Lastly, in high-level dia-grams, it is often desirable to illustrate the relationships of the classeswithout becoming entangled in the details of the attributes and opera-tions. Classes can be written with just the class name compartmentwhen simplicity is desired.

Various types of relationships may exist between classes. Associationsare one type of relationship. The most generic form of association isdrawn with a line connecting two classes. For example, in Figure 3.1



there is an association between the class “Car” and the class named“Driver.”

A few types of associations, such as aggregation and composition, arevery common. UML has designated symbols for these associations.Aggregation indicates “part of” associations, where the parts have anindependent existence. For example, a Car may be part of a Car Pool.The Car also exists on its own, independent of any Car Pool. Anotherdistinguishing feature of aggregation is that the part may be shared

Figure 3.1 Basic UML class diagram constructs

Class Name

attribute1attribute2

operation1()operation2()

Car

vinmileage

accelerate()brake()

Car

accelerate()brake()

Car

vinmileage

Car

Notation and Example

Notational Variations

Emphasizing Operations

Emphasizing Attributes

Emphasizing Class

Relationships

Classes

Association

Generalization

Aggregation

Composition

Car Driver

CarSedan

Car Frame

CarCar Pool



among multiple objects. For example, a Car may belong to more thanone Car Pool. The aggregation association is indicated with a hollow dia-mond attached to the class that holds the parts. Figure 3.1 indicates thata Car Pool aggregates Cars.

Composition is another “part of” association in which the parts arestrictly owned, not shared. For example, a Frame is part of a single Car.The notation for composition is an association adorned with a solidblack diamond attached to the class that owns the parts. Figure 3.1 indi-cates that a Frame is part of the composition of a Car.

Generalization is another common relationship. For example, Sedanis a type of car. The “Car” class is more general than the “Sedan” class.Generalization is indicated by a solid line adorned with a hollow arrow-head pointing to the more general class. Figure 3.1 shows generalizationfrom the Sedan class to the Car class.

3.1.2 Class Diagrams for Database Design

The reader may be interested in the similarities and differences betweenUML class diagrams and ER models. Figures 3.2 through 3.5 parallelsome of the figures in Chapter 2, allowing for easy comparison. We thenturn our attention to capturing primary key information in Figure 3.6.We conclude this section with an example database schema of the musicindustry, illustrated by Figures 3.7 through 3.10.

Figure 3.2 illustrates UML constructs for relationships with variousdegrees of association and multiplicities. These examples are parallel tothe ER models shown in Figure 2.2. You may refer back to Figure 2.2 ifyou wish to contrast UML constructs with ER constructs.

Associations between classes may be reflexive, binary or n-ary. Reflex-ive association is a term we are carrying over from ER modeling. It is not aterm defined in UML, although it is worth discussing. Reflexive associa-tion relates a class to itself. The reflexive association in Figure 3.2 meansan Employee in the role of manager is associated with many managedEmployees. The roles of classes in a relationship may be indicated at theends of the relationship. The number of objects involved in the relation-ship, referred to as multiplicity, may also be specified at the ends of therelationship. An asterisk indicates that many objects take part in theassociation at that end of the relationship. The multiplicities of thereflexive association example in Figure 3.2 indicate that an Employee isassociated with one manager, and a manager is associated with manymanaged Employees.



A binary association is a relationship between two classes. For exam-ple, one Division has many Departments. Notice the solid black diamondat the Division end of the relationship. The solid diamond is an adorn-

Figure 3.2 Selected UML relationship types (parallel to Figure 2.2)

Department

Department

Department

Office

Employee

Employee

Employee

Employee

Employee

Project

1

1

Employee

Department Division

Project

1

managed

manager 1

*

Skillskill used

Employee

assignment

manager

manager

0..1

occupant

0..*

WorkAssignment

task-assignmentstart-date

reflexiveassociation

binaryassociation

ternaryassociation

Degree

Multiplicities

one-to-one

one-to-many

many-to-many

optional

mandatory

Existence

1

1

1

*

*

* *

* *

*

Representation & ExampleConcept



ment to the associations that indicates composition. The Division is com-posed of Departments.

The ternary relationship in Figure 3.2 is an example of an n-ary asso-ciation—an association that relates three or more classes. All classes par-taking in the association are connected to a hollow diamond. Roles and/or multiplicities are optionally indicated at the ends of the n-ary associa-tion. Each end of the ternary association example in Figure 3.2 is markedwith an asterisk, signifying many. The meaning of each multiplicity isisolated from the other multiplicities. Given a class, if you have exactlyone object from every other class in the association, the multiplicity isthe number of associated objects for the given class. One Employeeworking on one Project assignment uses many Skills. One Employee usesone Skill on many Project assignments. One Skill used on one Project isfulfilled by many Employees.

The next three class diagrams in Figure 3.2 show various combina-tions of multiplicities. The illustrated one-to-one association specifiesthat each Department is associated with exactly one Employee acting inthe role of manager, and each manager is associated with exactly oneDepartment. The diagram with the one-to-many association means thateach Department has many Employees, and each Employee belongs toexactly one Department.

The many-to-many example in Figure 3.2 means each Employeeassociates with many Projects, and each Project associates with manyEmployees. This example also illustrates the use of an association class.If an association has attributes, these are written in a class that isattached to the association with a dashed line. The association classnamed “WorkAssignment” in Figure 3.2 contains two associationattributes named “task-assignment” and “start-date.” The associationand the class together form an association class.

Multiplicity can be a range of integers, written with the minimumand maximum values separated by two periods. The asterisk by itself car-ries the same meaning as the range [0 .. *]. Also, if the minimum andmaximum are the same number, then the multiplicity can be written asa single number. For example, [1 .. 1] means the same as [1]. Optionalexistence can be specified using a zero. The [0 .. 1] in the optional exist-ence example of Figure 3.2 means an Employee in the role of manager isassociated with either no Department (e.g., is upper management), orone Department.

Mandatory existence is specified whenever a multiplicity begins witha positive integer. The example of mandatory existence in Figure 3.2means an Employee is an occupant of exactly one Office. One end of an



association can indicate mandatory existence, while the other end mayuse optional existence. This is the case in the example, where an Officemay have any number of occupants, including zero.

Generalization is another type of relationship; a superclass is a gener-alization of a subclass. Specialization is the opposite of generalization; asubclass is a specialization of the superclass. The generalization relation-ship in UML is written with a hollow arrow pointing from the subclassto the generalized superclass. The top example in Figure 3.3 shows foursubclasses: Manager, Engineer, Technician, and Secretary. These foursubclasses are all specializations of the more general superclassEmployee; that is, Managers, Engineers, Technicians, and Secretaries aretypes of Employees.

Notice the four relationships share a common arrowhead. Semanti-cally, these are still four separate relationships. The sharing of the arrow-head is permissible in UML, to improve the clarity of the diagrams.

The bottom example in Figure 3.3 illustrates that a class can act asboth a subclass in one relationship, and a superclass in another relation-

Figure 3.3 UML generalization constructs (parallel to Figure 2.4)

Employee

SecretaryTechnicianEngineerManager

Individual

CustomerEmployee

EmpCust

Completeenumeration ofsubclasses



ship. The class named Individual is a generalization of the Employee andCustomer classes. The Employee and Customer classes are in turn super-classes of the EmpCust class. A class can be a subclass in more than onegeneralization relationship. The meaning in the example is that an Emp-Cust object is both an Employee and a Customer.

You may occasionally find that UML doesn’t supply a standard sym-bol for what you are attempting to communicate. UML incorporatessome extensibility to accommodate user needs, such as a note. A note inUML is written as a rectangle with a dog-eared upper right corner. Thenote can attach to the pertinent element(s) with a dashed line(s). Writebriefly in the note what you wish to convey. The bottom diagram in Fig-ure 3.3 illustrates a note, which describes the Employee and Customerclasses as the “Complete enumeration of subclasses.”

The distinction between composition and aggregation is sometimeselusive for those new to UML. Figure 3.4 shows an example of each, tohelp clarify. The top diagram means that a Program and Electronic Doc-umentation both contribute to the composition of a Software Product.The composition signifies that the parts do not exist without the Soft-ware Product (there is no software pirating in our ideal world). The bot-tom diagram specifies that a Teacher and a Textbook are aggregated by acourse. The aggregation signifies that the Teacher and the Textbook arepart of the Course, but they also exist separately. If a Course is canceled,the Teacher and the Textbook continue to exist.

Figure 3.4 UML aggregation constructs (parallel to Figure 2.5)

Program Electronic Documentation

Software Product

Teacher Textbook

Course



Figure 3.5 illustrates another example of a n-ary relationship. The n-ary relationship may be clarified by specifying roles next to the partici-pating classes. A Student is an enrollee in a class, associated with a givenRoom location, scheduled Day, and meeting Time.

The concept of a primary key arises in the context of databasedesign. Often, each row of a table is uniquely identified by the valuescontained in one or more columns designated as the primary key.Objects in software are not typically identified in this fashion. As aresult, UML does not have an icon representing a primary key. However,UML is extensible. The meaning of an element in UML may be extended

Figure 3.5 UML n-ary relationship (parallel to Figure 2.7)

Figure 3.6 UML constructs illustrating primary keys

Day

CourseStudent

TimeRoom

enrollee class

scheduleddaylocation

meetingtime

Car

«pk» vinmileagecolor

Primary key as a stereotype

Composition examplewith primary keys

Invoice

«pk» inv_numcustomer_idinv_date

LineItem

«pk» inv_num«pk» line_numdescriptionamount

1 .. *



with a stereotype. Stereotypes are depicted with a short natural languageword or phrase, enclosed in guillemets: « and ». We take advantage ofthis extensibility, using a stereotype «pk» to designate primary keyattributes. Figure 3.6 illustrates the stereotype mechanism. The vinattribute is specified as the primary key for Cars. This means that a givenvin identifies a specific Car. A noteworthy rule of thumb for primarykeys: when a composition relationship exists, the primary key of thepart includes the primary key of the owning object. The second diagramin Figure 3.6 illustrates this point.

3.1.3 Example from the Music Industry

Large database schemas may be introduced with high-level diagrams.Details can be broken out in additional diagrams. The overall goal is topresent ideas in a clear, organized fashion. UML offers notational varia-tions and organizational mechanism. You will sometimes find thatthere are multiple ways of representing the same material in UML. Thedecisions you make with regard to your representation depend in parton your purpose for a given diagram. Figures 3.7 through 3.10 illus-trate some of the possibilities, with an example drawn from the musicindustry.

Packages may be used to organize classes into groups. Packages maythemselves also be grouped into packages. The goal of using packages isto make the overall design of a system more comprehensible. One usefor packages is to represent a schema. You can then show multiple sche-mas concisely. Another use for packages is to group related classestogether within a schema, and present the schema clearly. Given a set ofclasses, different people may conceptualize different groupings. The divi-sion is a design decision, with no right or wrong answer. Whatever deci-sions are made, the result should enhance readability. The notation for apackage is a folder icon, and the contents of a package can be optionallyshown in the body of the folder. If the contents are shown, then thename of the package is placed in the tab. If the contents are elided, thenthe name of the package is placed in the body of the icon.

Figure 3.7 Example of related packages

Music Media Distribution



If the purpose is to illustrate the relationships of the packages, andthe classes are not important at the moment, then it is better to illustratewith the contents elided. Figure 3.7 illustrates the notation with themusic industry example at a very high level. Music is created and placedon Media. The Media is then Distributed. There is an associationbetween Music and Media, and between Media and Distribution.

Let us look at the organization of the classes. The music industry isillustrated in Figure 3.8 with the classes listed. The Music package con-tains classes that are responsible for creating the music. Examples ofGroups are the Beatles and the Bangles. Sarah McLachlan and Sting areArtists. Groups and Artists are involved in creating the music. We willlook shortly at the other classes and how they are related. The Media

Figure 3.8 Example illustrating classes grouped into packages

Figure 3.9 Relationships between classes in the music package

Distribution

StudioPublisherRetailStore

Media

MusicMediaAlbumCDTrack

Music

GroupArtistComposerLyricistMusicianInstrumentSongRendition

Artist

Instrument

Song

Group

0 .. *

1 .. *1 .. *

2 .. *

1

*

*

**

Rendition

MusicianComposer Lyricist

0 .. *1 .. *



package contains classes that physically hold the recordings of themusic. The Distribution package contains classes that bring the media toyou.

The contents of a package can be expanded into greater detail. Therelationships of the classes within the Music package are illustrated inFigure 3.9. A Group is an aggregation of two or more Artists. As indicatedby the multiplicity between Artist and Group, [0 .. *], an Artist may ormay not be in a Group, and may be in more than one Group. Compos-ers, Lyricists, and Musicians are different types of Artists. A Song is asso-ciated with one or more Composers. A Song may not have any Lyricist,or any number of Lyricists. A Song may have any number of Renditions.A Rendition is associated with exactly one Song. A Rendition is associ-ated with Musicians and Instruments. A given Musician-Instrumentcombination is associated with any number of Renditions. A specificRendition-Musician combination may be associated with any number ofInstruments. A given Rendition-Instrument combination is associatedwith any number of Musicians.

A system can be understood more easily by shifting focus to eachpackage in turn. We turn our attention now to the classes and relation-ships in the Media package, shown in Figure 3.10. The associated classesfrom the Music and Distribution packages are also shown, detailing howthe Media package is related to the other two packages. The Music Media

Figure 3.10 Classes of the media package and related classes

StudioMusic Media

Rendition

Producer

Album CD

Track

PublisherGroup

Artist



is associated with the Group and Artist classes, which are contained inthe Music package shown in Figure 3.8. The Music Media is also associ-ated with the Publisher, Studio, and Producer classes, which are con-tained in the Distribution package shown in Figure 3.8. Albums and CDsare types of Music Media. Albums and CDs are both composed of Tracks.Tracks are associated with Renditions.

3.2 Activity Diagrams

UML has a full suite of diagram types, each of which fulfills a need fordescribing a view of the design. UML activity diagrams are used to specifythe activities and the flow of control in a process. The process may be aworkflow followed by people, organizations, or other physical things.Alternatively, the process may be an algorithm implemented in software.The syntax and the semantics of the UML constructs are the same,regardless of the process described. Our examples draw from workflowsthat are followed by people and organizations, since these are more use-ful for the logical design of databases.

3.2.1 Activity Diagram Notation Description

Activity diagrams include notation for nodes, control flow, and organi-zation. The icons we are describing here are outlined in Figure 3.11. Thenotation is further clarified by example in Section 3.2.2.

The nodes include initial node, final node, and activity node. Any pro-cess begins with control residing in the initial node, represented as asolid black circle. The process terminates when control reaches a finalnode, represented as a solid black circle surrounded by a concentric cir-cle (i.e., a bull’s-eye). Activity nodes are states where specified work isprocessed. For example, an activity might be named “Generate quote.”The name of an activity is typically a descriptive verb or short verbphrase, written inside a lozenge shape. Control resides in an activityuntil that activity is completed. Then control follows the outgoing flow.

Control flow icons include flows, decisions, forks, and joins. A flow isdrawn with an arrow. Control flows in the direction of the arrow. Deci-sion nodes are drawn as a hollow diamond with multiple outgoingflows. Each flow from a decision node must have a guard condition. Aguard condition is written in brackets next to the flow. Control flows inexactly one direction from a decision node, and only follows a flow ifthe guard condition is true. The guard conditions associated with a deci-


3.2 Activity Diagrams 47

sion node must be mutually exclusive, to avoid nondeterministic behav-ior. There can be no ambiguity as to which direction the control flows.The guards must cover all possible test conditions, so that control is notblocked at the decision node. One path may be guarded with [else]. If apath is guarded by [else], then control flows in that direction only if allthe other guards fail. Forks and joins are both forms of synchronizationwritten with a solid bar. The fork has one incoming flow and multipleoutgoing flows. When control flows to a fork, the control concurrently

Figure 3.11 UML activity diagram constructs

Activity Name

initial node

final node

activity node

Control

Nodes

flow

decision (branch)

fork

join

Organization

partition (swim lanes)

Subset Name 1 Subset Name 2

[guard]

[alternativeguard]



follows all the outgoing flows. These are referred to as concurrentthreads. Joins are the opposite of forks; the join construct has multipleincoming flows and one outgoing flow. Control flows from a join onlywhen control has reached the join from each of the incoming flows.

Activity diagrams may be further organized using partitions, alsoknown as swim lanes. Partitions split activities into subsets, organized byresponsible party. Each subset is named and enclosed with lines.

3.2.2 Activity Diagrams for Workflow

Figure 3.12 illustrates the UML activity diagram constructs used for thepublication of this book. This diagram is partitioned into two subsets ofactivities, organized by responsible party: the left subset contains Cus-tomer activities, and the right subset contains Manufacturer activities.Activity partitions are sometimes called swim lanes, because of their typ-ical appearance. Activity partitions may be arranged vertically, horizon-tally, or in a grid. Curved dividers may be used, although this is atypical.Activity diagrams can also be written without a partition. The constructis organizational, and doesn’t carry inherent semantics. The meaning issuggested by your choice of subset names.

Control begins in the initial state, represented by the solid dot in theupper-left corner of Figure 3.12. Control flows to the first activity, wherethe customer requests a quote (Request quote). Control remains in anactivity until that activity is completed; then the control follows theoutgoing arrow. When the request for a quote is complete, the Manufac-turer generates a quote (Generate quote). Then the Customer reviews thequote (Review quote).

The next construct is a branch, represented by a diamond. Each out-going arrow from a branch has a guard. The guard represents a conditionthat must be true in order for control to flow along that path. Guards arewritten as short condition descriptions enclosed in brackets. After theCustomer finishes reviewing the quote in Figure 3.12, if it is unaccept-able the process reaches a final state and terminates. A final state is rep-resented with a target (the bull’s-eye). If the quote is acceptable, then theCustomer places an order (Place order). The Manufacturer enters (Enterorder), produces (Produce order), and ships the order (Ship order).

At a fork, control splits into multiple concurrent threads. The nota-tion is a solid bar with one incoming arrow and multiple outgoingarrows. After the order ships in Figure 3.12, control reaches a fork andsplits into two threads. The Customer receives the order (Receive order).


3.2 Activity Diagrams 49

In parallel to the Customer receiving the order, the Manufacturer gener-ates an invoice (Generate invoice), and then the Customer receives theinvoice (Receive invoice). Order of activities between threads is not con-strained. Thus, the Customer may receive the order before or after themanufacturer generates the invoice, or even after the Customer receivesthe invoice.

At a join, multiple threads merge into a single thread. The notationis a solid bar with multiple incoming arrows and one outgoing arrow. In

Figure 3.12 UML activity diagram, manufacturing example.

Customer Manufacturer

Generate quoteRequest quote

[acceptable]

Review quote

[unacceptable]

Place order Enter order

Produce order

Ship order

Receive order

Generate invoiceReceive invoice

Pay Record payment



Figure 3.12, after the Customer receives the order and the invoice, thenthe Customer will pay (Pay). All incoming threads must complete beforecontrol continues along the outgoing arrow.

Finally, in Figure 3.12, the Customer pays, the Manufacturer recordsthe payment (Record payment), and then a final state is reached. Noticethat an activity diagram may have multiple final states. However, therecan only be one initial state.

There are at least two uses for activity diagrams in the context ofdatabase design. Activity diagrams can specify the interactions of classesin a database schema. Class diagrams capture structure, activity diagramscapture behavior. The two types of diagrams can present complementaryaspects of the same system. For example, one can easily imagine thatFigure 3.12 illustrates the usage of classes named Quote, Order, Invoice,and Payment. Another use for activity diagrams in the context of data-base design is to illustrate processes surrounding the database. For exam-ple, database life cycles can be illustrated using activity diagrams.

3.3 Rules of Thumb for UML Usage

1. Decide what you wish to communicate first, and then focus yourdescription. Illustrate the details that further your purpose, andelide the rest. UML is like any other language in that you canimmerse yourself in excruciating detail and lose your main pur-pose. Be concise.

2. Keep each UML diagram to one page. Diagrams are easier tounderstand if they can be seen in one glance. This is not to saythat you must restrict yourself; rather, you should divide andorganize your content into reasonable, understandable portions.Use packages to organize your presentation. If you have manybrilliant ideas to convey (of course you do!), begin with a high-level diagram that paints the broad picture. Then follow up with adiagram dedicated to each of your ideas.

3. Use UML when it is useful. Don’t feel compelled to write a UMLdocument just because you feel you need a UML document. UMLis not an end in itself, but it is an excellent design tool for appro-priate problems.

4. Accompany your diagrams with textual descriptions, thereby clar-ifying your intent. Additionally, remember that some people are


3.4 Summary 51

oriented verbally, others visually. Combining natural languagewith UML is effective.

5. Take care to clearly organize each diagram. Avoid crossing associa-tions. Group elements together if there is a connection in yourmind. Two UML diagrams can contain the exact same elementsand associations, and one might be a jumbled mess, while theother is elegant and clear. Both convey the same meaning in theUML, but clearly the elegant version will be more successful atcommunicating design issues.

3.4 Summary

The Unified Modeling Language (UML) is a graphical language that iscurrently very popular for communicating design specifications for soft-ware and in particular for logical database designs via class diagrams.The similarity between UML and the ER model is shown through somecommon examples, including ternary relationships and generalization.UML activity diagrams are used to specify the activities and flow of con-trol in processes. Use of UML in logical database design is summarizedwith five basic rules of thumb.


The definitive reference manual for UML is Rumbaugh, Jacobson, andBooch [2005]. Use Mullins [1999] for more detailed UML database mod-eling. Other useful UML texts are Naiburg and Maksimchuk [2001], Qua-trani [2002], and Rumbaugh, Jacobson, and Booch [2004].


53

4Requirements Analysis and Conceptual Data Modeling

his chapter shows how the ER and UML approaches can be appliedto the database life cycle, particularly in steps I through II(b) (as

defined in Section 1.2), which include the requirements analysis andconceptual data modeling stages of logical database design. The exampleintroduced in Chapter 2 is used again to illustrate the ER modeling prin-ciples developed in this chapter.

4.1 Introduction

Logical database design is accomplished with a variety of approaches,including the top-down, bottom-up, and combined methodologies. Thetraditional approach, particularly for relational databases, has been alow-level, bottom-up activity, synthesizing individual data elements intonormalized tables after carefully analyzing the data element interdepen-dencies defined during the requirements analysis. Although the tradi-tional process has been somewhat successful for small- to medium-sizeddatabases, when used for large databases its complexity can be over-whelming to the point where practicing designers do not bother to use itwith any regularity. In practice, a combination of the top-down and bot-tom-up approaches is used; in most cases, tables can be defined directlyfrom the requirements analysis.

The conceptual data model has been most successful as a tool forcommunication between the designer and the end user during the

T


54 CHAPTER 4 Requirements Analysis and Conceptual Data Modeling

requirements analysis and logical design phases. Its success is due to thefact that the model, using either ER or UML, is easy to understand andconvenient to represent. Another reason for its effectiveness is that it is atop-down approach using the concept of abstraction. The number ofentities in a database is typically far fewer than the number of individualdata elements, because data elements usually represent the attributes.Therefore, using entities as an abstraction for data elements and focus-ing on the relationships between entities greatly reduces the number ofobjects under consideration and simplifies the analysis. Though it is stillnecessary to represent data elements by attributes of entities at the con-ceptual level, their dependencies are normally confined to the otherattributes within the entity or, in some cases, to attributes associatedwith other entities with a direct relationship to their entity.

The major interattribute dependencies that occur in data models arethe dependencies between the entity keys, the unique identifiers of differ-ent entities that are captured in the conceptual data modeling process.Special cases, such as dependencies among data elements of unrelatedentities, can be handled when they are identified in the ensuing dataanalysis.

The logical database design approach defined here uses both theconceptual data model and the relational model in successive stages. Itbenefits from the simplicity and ease of use of the conceptual datamodel and the structure and associated formalism of the relationalmodel. To facilitate this approach, it is necessary to build a frameworkfor transforming the variety of conceptual data model constructs intotables that are already normalized or that can be normalized with a min-imum of transformation. The beauty of this type of transformation isthat it results in normalized or nearly normalized SQL tables from thestart; frequently, further normalization is not necessary.

Before we do this, however, we need to first define the major steps ofthe relational logical design methodology in the context of the databaselife cycle.

4.2 Requirements Analysis

Step I, requirements analysis, is an extremely important step in the data-base life cycle and is typically the most labor intensive. The databasedesigner must interview the end user population and determine exactlywhat the database is to be used for and what it must contain. The basicobjectives of requirements analysis are:



• To delineate the data requirements of the enterprise in terms ofbasic data elements

• To describe the information about the data elements and the rela-tionships among them needed to model these data requirements

• To determine the types of transactions that are intended to beexecuted on the database and the interaction between the trans-actions and the data elements

• To define any performance, integrity, security, or administrativeconstraints that must be imposed on the resulting database

• To specify any design and implementation constraints, such asspecific technologies, hardware and software, programming lan-guages, policies, standards, or external interfaces

• To thoroughly document all of the preceding in a detailedrequirements specification. The data elements can also be definedin a data dictionary system, often provided as an integral part ofthe database management system

The conceptual data model helps designers accurately capture thereal data requirements because it requires them to focus on semanticdetail in the data relationships, which is greater than the detail thatwould be provided by FDs alone. The semantics of the ER model, forinstance, allow for direct transformations of entities and relationships toat least first normal form (1NF) tables. They also provide clear guidelinesfor integrity constraints. In addition, abstraction techniques such as gen-eralization provide useful tools for integrating end user views to define aglobal conceptual schema.

4.3 Conceptual Data Modeling

Let us now look more closely at the basic data elements and relation-ships that should be defined during requirements analysis and concep-tual design. These two life cycle steps are often done simultaneously.

Consider the substeps in step II(a), conceptual data modeling, usingthe ER model:

• Classify entities and attributes (classify classes and attributes inUML)

• Identify the generalization hierarchies (for both the ER model andUML)



• Define relationships (define associations and association classesin UML)

The remainder of this section discusses the tasks involved in eachsubstep.

4.3.1 Classify Entities and Attributes

Though it is easy to define entity, attribute, and relationship constructs,it is not as easy to distinguish their roles in modeling the database. Whatmakes a data element an entity, an attribute, or even a relationship? Forexample, project headquarters are located in cities. Should “city” be anentity or an attribute? A vita is kept for each employee. Is “vita” anentity or a relationship?

The following guidelines for classifying entities and attributes willhelp the designer’s thoughts converge to a normalized relational data-base design:

• Entities should contain descriptive information.

• Multivalued attributes should be classified as entities.

• Attributes should be attached to the entities they most directlydescribe.

Now we examine each guideline in turn.

Entity Contents

Entities should contain descriptive information. If there is descriptiveinformation about a data element, the data element should be classifiedas an entity. If a data element requires only an identifier and does nothave relationships, it should be classified as an attribute. With “city,” forexample, if there is some descriptive information such as “country” and“population” for cities, then “city” should be classified as an entity. Ifonly the city name is needed to identify a city, then “city” should beclassified as an attribute associated with some entity, such as Project. Theexception to this rule is that if the identity of the value needs to be con-strained by set membership, you should create it as an entity. For exam-ple, “State” is much the same as city, but you probably want to have aState entity that contains all the valid State instances. Examples of otherdata elements in the real world that are typically classified as entities



include Employee, Task, Project, Department, Company, Customer, andso on.

Multivalued Attributes

Classify multivalued attributes as entities. If more than one value of adescriptor attribute corresponds to one value of an identifier, thedescriptor should be classified as an entity instead of an attribute, eventhough it does not have descriptors itself. A large company, for example,could have many divisions, some of them possibly in different cities. Inthat case, “division” could be classified as a multivalued attribute of“company,” but it would be better classified as an entity, with “division-address” as its identifier. If attributes are restricted to be single valuedonly, the later design and implementation decisions will be simplified.

Attribute Attachment

Attach attributes to the entities they most directly describe. For example,“office-building-name” should normally be an attribute of the entityDepartment, rather than the entity Employee. The procedure of identify-ing entities and attaching attributes to entities is iterative. Classify somedata elements as entities and attach identifiers and descriptors to them.If you find some violation of the preceding guidelines, change some dataelements from entity to attribute (or from attribute to entity), attachattributes to the new entities, and so forth.

4.3.2 Identify the Generalization Hierarchies

If there is a generalization hierarchy among entities, then put the identi-fier and generic descriptors in the supertype entity and put the sameidentifier and specific descriptors in the subtype entities.

For example, suppose five entities were identified in the ER modelshown in Figure 2.4a:

• Employee, with identifier empno and descriptors empname,address, and date-of-birth

• Manager, with identifier empno and descriptors empname andjobtitle

• Engineer, with identifier empno and descriptors empname, high-est-degree and jobtitle



• Technician, with identifier empno, and descriptors empname andspecialty

• Secretary, with identifier empno, and descriptors empname andbest-skill

Let’s say we determine, through our analysis, that the entityEmployee could be created as a generalization of Manager, Engineer,Technician, and Secretary. Then we put identifier empno and genericdescriptors empname, address, and date-of-birth in the supertype entityEmployee; identifier empno and specific descriptor jobtitle in the sub-type entity Manager; identifier empno and specific descriptor highest-degree and jobtitle in the subtype entity Engineer, etc. Later, if we decideto eliminate Employee as a table, the original identifiers and genericattributes can be redistributed to all the subtype tables. (Note that weput table names in boldface throughout the book for readability.)

4.3.3 Define Relationships

We now deal with data elements that represent associations among enti-ties, which we call relationships. Examples of typical relationships are“works-in,” “works-for,” “purchases,” “drives,” or any verb that connectsentities. For every relationship, the following should be specified: degree(binary, ternary, etc.); connectivity (one-to-many, etc.); optional or man-datory existence; and any attributes associated with the relationship andnot the entities. The following are some guidelines for defining the moredifficult types of relationships.

Redundant Relationships

Analyze redundant relationships carefully. Two or more relationshipsthat are used to represent the same concept are considered redundant.Redundant relationships are more likely to result in unnormalized tableswhen transforming the ER model into relational schemas. Note that twoor more relationships are allowed between the same two entities, as longas those relationships have different meanings. In this case they are notconsidered redundant. One important case of nonredundancy is shownin Figure 4.1a for the ER model and Figure 4.1c for UML. If “belongs-to”is a one-to-many relationship between Employee and Professional-asso-ciation, if “located-in” is a one-to-many relationship between Profes-sional-association and City, and if “lives-in” is a one-to-many relation-



Figure 4.1 Redundant relationships

Employee

Employee

Professional-association

(a) Nonredundant relationships

(b) Redundant relationships using transitivity

Project

located-in

located-in

City

City

lives-in

works-in

belongs-to

works-on

N

N

N

N

N

N

1

1

1

1

1

1

Employee

Employee

Professional-association

(c) Nonredundant associations

(d) Redundant associations using transitivity

Project

located-in

located-in

City

City

lives-in

works-in

belongs-to

works-on

*

*

*

*

1

*

*

1

1

1

1

1



ship between Employee and City, then “lives-in” is not redundant,because the relationships are unrelated. However, consider the situationshown in Figure 4.1b for the ER model and Figure 4.1d for UML. Theemployee works on a project located in a city, so the “works-in” relation-ship between Employee and City is redundant and can be eliminated.

Ternary Relationships

Define ternary relationships carefully. We define a ternary relationshipamong three entities only when the concept cannot be represented byseveral binary relationships among those entities. For example, let usassume there is some association among entities Technician, Project,and Notebook. If each technician can be working on any of severalprojects and using the same notebooks on each project, then threemany-to-many binary relationships can be defined (see Figure 4.2a forthe ER model and Figure 4.2c for UML). If, however, each technician is

Figure 4.2 Ternary relationships

Technician

(a) Binary relationships

Project has Notebook

uses

works-on

N

N

NN

NN

Technician

(b) Different meaning using a ternary relationship

Project

uses-notebook

Notebook

1 1

1

Technician

Project Notebook

(c) Binary associations

*

*

*

* **

works-on

has

Technician Project

Notebook

(d) Different meaning using a ternary association

* *

*

uses-notebook

uses



constrained to use exactly one notebook for each project and that note-book belongs to only one technician, then a one-to-one-to-one ternaryrelationship should be defined (see Figure 4.2b for the ER model and Fig-ure 4.2d for UML). The approach to take in ER modeling is to firstattempt to express the associations in terms of binary relationships; ifthis is impossible because of the constraints of the associations, try toexpress them in terms of a ternary.

The meaning of connectivity for ternary relationships is important.Figure 4.2b shows that for a given pair of instances of Technician andProject, there is only one corresponding instance of Notebook; for agiven pair of instances of Technician and Notebook, there is only onecorresponding instance of Project; and for a given pair of instances ofProject and Notebook, there is only one instance of Technician. In gen-eral, we know by our definition of ternary relationships that if a relation-ship among three entities can only be expressed by a functional depen-dency involving the keys of all three entities, then it cannot beexpressed using only binary relationships, which only apply to associa-tions between two entities. Object-oriented design provides arguably abetter way to model this situation [Muller, 1999].

4.3.4 Example of Data Modeling: Company Personnel and Project Database

ER Modeling of Individual Views Based on Requirements

Let us suppose it is desirable to build a company-wide database for alarge engineering firm that keeps track of all full-time personnel, theirskills and projects assigned, the departments (and divisions) worked in,the engineer professional associations belonged to, and the engineerdesktop computers allocated. During the requirements collection pro-cess—that is, interviewing the end users—we obtain three views of thedatabase.

The first view, a management view, defines each employee as work-ing in a single department, and defines a division as the basic unit in thecompany, consisting of many departments. Each division and depart-ment has a manager, and we want to keep track of each manager. The ERmodel for this view is shown in Figure 4.3a.

The second view defines each employee as having a job title: engi-neer, technician, secretary, manager, and so on. Engineers typicallybelong to professional associations and might be allocated an engineer-



ing workstation (or computer). Secretaries and managers are each allo-cated a desktop computer. A pool of desktops and workstations is main-tained for potential allocation to new employees and for loans while anemployee’s computer is being repaired. Any employee may be married toanother employee, and we want to keep track of these relationships toavoid assigning an employee to be managed by his or her spouse. Thisview is illustrated in Figure 4.3b.

The third view, shown in Figure 4.3c, involves the assignment ofemployees, mainly engineers and technicians, to projects. Employeesmay work on several projects at one time, and each project could beheadquartered at different locations (cities). However, each employee ata given location works on only one project at that location. Employeeskills can be individually selected for a given project, but no individualhas a monopoly on skills, projects, or locations.

Figure 4.3 Example of data modeling

is-managed-by

contains

is-headed-byhas

1

1

N

1

1

11N

Employee

Department

Division

(a) Management view




(b) Employee view

Employee

11

1N

is-married-to manages

+

Manager

Desktop Workstation Prof-assoc

Secretary Engineer Technician

d

1 11

111

N

N

is-allocated is-allocated belongs-tohas-allocated

(c) Employee assignment view

Skill

N

N

1Project

N

N

N

Location

Employee

skill-used

assigned-to



Global ER Schema

A simple integration of the three views over the entity Employee definesresults in the global ER schema (diagram) in Figure 4.3d, which becomesthe basis for developing the normalized tables. Each relationship in theglobal schema is based upon a verifiable assertion about the actual datain the enterprise, and analysis of those assertions leads to the transfor-mation of these ER constructs into candidate SQL tables, as Chapter 5shows.

Note that equivalent views and integration could be done for a UMLconceptual model over the class Employee. We will use the ER model forthe examples in the rest of this chapter, however.

The diagram shows examples of binary, ternary, and binary recursiverelationships; optional and mandatory existence in relationships; andgeneralization with the disjointness constraint. Ternary relationships“skill-used” and “assigned-to” are necessary, because binary relation-ships cannot be used for the equivalent notions. For example, oneemployee and one location determine exactly one project (a functionaldependency). In the case of “skill-used,” selective use of skills to projectscannot be represented with binary relationships (see Section 6.5).

The use of optional existence, for instance, between Employee andDivision or between Employee and Department, is derived from our gen-eral knowledge that most employees will not be managers of any divi-sion or department. In another example of optional existence, we showthat the allocation of a workstation to an engineer may not alwaysoccur, nor will all desktops or workstations necessarily be allocated tosomeone at all times. In general, all relationships, optional existenceconstraints, and generalization constructs need to be verified with theend user before the ER model is transformed to SQL tables.

In summary, the application of the ER model to relational databasedesign offers the following benefits:

• Use of an ER approach focuses end users’ discussions on impor-tant relationships between entities. Some applications are charac-terized by counterexamples affecting a small number of instances,and lengthy consideration of these instances can divert attentionfrom basic relationships.

• A diagrammatic syntax conveys a great deal of information in acompact, readily understandable form.

• Extensions to the original ER model, such as optional and manda-tory membership classes, are important in many relationships.



(d)


belongs-to

N

N

is-allocated

1

1

has-allocated

1

1

is-managed-by

contains

is-headed-byhas

d

N

N

1

1

N

1

1

1 1

1N

EmployeeN

1

Project

TechnicianEngineerSecretaryManager

Prof-assocWorkstationDesktop

skill-used

assigned-to

is-married-to

manages

1

1N +

Department

Division

N

N

Location

Skill

is-allocated

1

1



Generalization allows entities to be grouped for one functionalrole or to be seen as separate subtypes when other constraints areimposed.

• A complete set of rules transforms ER constructs into mostly nor-malized SQL tables, which follow easily from real-world require-ments.

4.4 View Integration

A critical part of the database design process is step II(b), the integrationof different user views into a unified, nonredundant global schema. Theindividual end-user views are represented by conceptual data models,and the integrated conceptual schema results from sufficient analysis ofthe end-user views to resolve all differences in perspective and terminol-ogy. Experience has shown that nearly every situation can be resolved ina meaningful way through integration techniques.

Schema diversity occurs when different users or user groups developtheir own unique perspectives of the world or, at least, of the enterpriseto be represented in the database. For instance, the marketing divisiontends to have the whole product as a basic unit for sales, but the engi-neering division may concentrate on the individual parts of the wholeproduct. In another case, one user may view a project in terms of itsgoals and progress toward meeting those goals over time, but anotheruser may view a project in terms of the resources it needs and the per-sonnel involved. Such differences cause the conceptual models to seemto have incompatible relationships and terminology. These differencesshow up in conceptual data models as different levels of abstraction,connectivity of relationships (one-to-many, many-to-many, and so on),or as the same concept being modeled as an entity, attribute, or relation-ship, depending on the user’s perspective.

As an example of the latter case, in Figure 4.4 we see three differentperspectives of the same real-life situation—the placement of an orderfor a certain product. The result is a variety of schemas. The first schema(Figure 4.4a) depicts Customer, Order, and Product as entities and“places” and “for-a” as relationships. The second schema (Figure 4.4b),however, defines “orders” as a relationship between Customer and Prod-uct and omits Order as an entity altogether. Finally, in the third case(Figure 4.4c), the relationship “orders” has been replaced by anotherrelationship, “purchases”; “order-no,” the identifier (key) of an order, isdesignated as an attribute of the relationship “purchases.” In other


4.4 View Integration 67

words, the concept of order has been variously represented as an entity,a relationship, and an attribute, depending on perspective.

There are four basic steps needed for conceptual schema integration:

1. Preintegration analysis

2. Comparison of schemas

3. Conformation of schemas

4. Merging and restructuring of schemas

4.4.1 Preintegration Analysis

The first step, preintegration analysis, involves choosing an integrationstrategy. Typically, the choice is between a binary approach with two

Figure 4.4 Schemas: placement of an order

(a) The concept of order as an entity

(b) The concept of order as a relationship

1

N

N

1

Customer places Order

N Norders Product

Product

for-a

Customer

(c) The concept of order as an attribute

N Npurchases ProductCustomer

order-no



schemas merged at one time and an n-ary approach with n schemasmerged at one time, where n is between 2 and the total number of sche-mas developed in the conceptual design. The binary approach is attrac-tive because each merge involves a small number of data model con-structs and is easier to conceptualize. The n-ary approach may requireonly one grand merge, but the number of constructs may be so largethat it is not humanly possible to organize the transformations properly.

4.4.2 Comparison of Schemas

In the second step, comparison of schemas, the designer looks at howentities correspond and detects conflicts arising from schema diversity—that is, from user groups adopting different viewpoints in their respec-tive schemas. Naming conflicts include synonyms and homonyms. Syn-onyms occur when different names are given for the same concept;these can be detected by scanning the data dictionary, if one has beenestablished for the database. Homonyms occur when the same name isused for different concepts. These can only be detected by scanning thedifferent schemas and looking for common names.

Structural conflicts occur in the schema structure itself. Type con-flicts involve using different constructs to model the same concept. InFigure 4.4, for example, an entity, a relationship, or an attribute can beused to model the concept of order in a business database. Dependencyconflicts result when users specify different levels of connectivity (one-to-many, etc.) for similar or even the same concepts. One way to resolvesuch conflicts might be to use only the most general connectivity—forexample, many-to-many. If that is not semantically correct, change thenames of entities so that each type of connectivity has a different set ofentity names. Key conflicts occur when different keys are assigned to thesame entity in different views. For example, a key conflict occurs if anemployee’s full name, employee ID number, and Social Security numberare all assigned as keys.

4.4.3 Conformation of Schemas

The resolution of conflicts often requires user and designer interaction.The basic goal of the third step is to align or conform schemas to makethem compatible for integration. The entities, as well as the keyattributes, may need to be renamed. Conversion may be required so thatconcepts modeled as entities, attributes, or relationships are conformedto be only one of them. Relationships with equal degree, roles, and con-



nectivity constraints are easy to merge. Those with differing characteris-tics are more difficult and, in some cases, impossible to merge. In addi-tion, relationships that are not consistent—for example, a relationshipusing generalization in one place and the exclusive OR in another—must be resolved. Finally, assertions may need to be modified so thatintegrity constraints remain consistent.

Techniques used for view integration include abstraction, such asgeneralization and aggregation to create new supertypes or subtypes, oreven the introduction of new relationships. As an example, the generali-zation of Individual over different values of the descriptor attribute “job-title” could represent the consolidation of two views of the database—one based on an individual as the basic unit of personnel in the organi-zation, and another based on the classification of individuals by jobtitles and special characteristics within those classifications.

4.4.4 Merging and Restructuring of Schemas

The fourth step consists of the merging and restructuring of schemas.This step is driven by the goals of completeness, minimality, and under-standability. Completeness requires all component concepts to appearsemantically intact in the global schema. Minimality requires thedesigner to remove all redundant concepts in the global schema. Exam-ples of redundant concepts are overlapping entities and truly semanti-cally redundant relationships; for example, Ground-Vehicle and Auto-mobile might be two overlapping entities. A redundant relationshipmight occur between Instructor and Student. The relationships “direct-research” and “advise” may or may not represent the same activity orrelationship, so further investigation is required to determine whetherthey are redundant or not. Understandability requires that the globalschema make sense to the user.

Component schemas are first merged by superimposing the sameconcepts and then restructuring the resulting integrated schema forunderstandability. For instance, if a supertype/subtype combination isdefined as a result of the merging operation, the properties of the sub-type can be dropped from the schema because they are automaticallyprovided by the supertype entity.

4.4.5 Example of View Integration

Let us look at two different views of overlapping data. The views arebased on two separate interviews of end users. We adapt the interesting



example cited by Batini, Lenzerini, and Navathe [1986] to a hypotheticalsituation related to our example.

In Figure 4.5a we have a view that focuses on reports and includesdata on departments that publish the reports, topic areas in reports, andcontractors for whom the reports are written. Figure 4.5b shows anotherview, with publications as the central focus and keywords on publicationas the secondary data. Our objective is to find meaningful ways to inte-grate the two views and maintain completeness, minimality, and under-standability.

We first look for synonyms and homonyms, particularly among theentities. Note that a synonym exists between the entities Topic-area inschema 1 and Keyword in schema 2, even though the attributes do not

Figure 4.5 View integration: find meaningful ways to integrate

NN

N

N

N1 written-for

contains

title

address

(a) Original schema 1, focused on reports

name

name

ReportpublishesDepartment

Topic-area

PublicationN N

contains Keyword

title

code

title

code

research-areadept-name

(b) Original schema 2, focused on publications

name

Contractor

address



match. However, we find that the attributes are compatible and can beconsolidated. This is shown in Figure 4.6a, which presents a revisedschema, schema 2.1. In schema 2.1 Keyword has been replaced by Topic-area.

Next we look for structural conflicts between schemas. A type con-flict is found to exist between the entity Department in schema 1 andthe attribute “dept-name” in schema 2.1. The conflict is resolved bykeeping the stronger entity type, Department, and moving the attributetype “dept-name” under Publication in schema 2 to the new entity,Department, in schema 2.2 (see Figure 4.6b).

Figure 4.6 View integration: type conflict

Publication

Department

N Ncontains Topic-area

title

code

title

code

research-area

title

code

research-area

dept-name

(a) Schema 2.1, in which Keyword has changed to Topic-area

dept-name1

has

title

code

N

(b) Schema 2.2, in which the attribute dept-name has changedtoan attribute and an entity(b)

PublicationN N

contains Topic-area



At this point we have sufficient commonality between schemas toattempt a merge. In schemas 1 and 2.2 we have two sets of commonentities, Department and Topic-area. Other entities do not overlap andmust appear intact in the superimposed, or merged, schema. The mergedschema, schema 3, is shown in Figure 4.7a. Because the common entitiesare truly equivalent, there are no bad side effects of the merge due toexisting relationships involving those entities in one schema and not inthe other. (Such a relationship that remains intact exists in schema 1between Topic-area and Report, for example.) If true equivalence cannotbe established, the merge may not be possible in the existing form.

In Figure 4.7, there is some redundancy between Publication andReport in terms of the relationships with Department and Topic-area.Such a redundancy can be eliminated if there is a supertype/subtyperelationship between Publication and Report, which does in fact occurin this case because Publication is a generalization of Report. In schema4.1 (Figure 4.7b) we see the introduction of this generalization fromReport to Publication. Then in schema 4.2 (Figure 4.7c) we see that the

Figure 4.7 View integration: the merged schema

Publication

includeshas

N N

NN

N

N

N1

1 N

contains

research-area

written-for

title

title

address

(a) Schema 3, the result of merging schema 1 and schema 2.2

code

code

namename

name

Report

Contractor

publishesDepartment Topic-area

address



(b) Schema 3.1, new generalization

(c) Schema 3.2, elimination of redundant relationships


Publication

includeshasd

N N

NN

N

N

N1

1 N

contains

research-area

written-for

title

title

address

code

code

namename

name

Report

Contractor

publishesDepartment Topic-area

address

Publication

includeshasd

N N

N

N

1 N

research-area

written-for

title

address

code

code

namename

name

Report

Contractor

Department Topic-area

address



redundant relationships between Report and Department and Topic-areahave been dropped. The attribute “title” has been eliminated as anattribute of Report in Figure 4.7c because “title” already appears as anattribute of Publication at a higher level of abstraction; “title” is inher-ited by the subtype Report.

The final schema, in Figure 4.7c, expresses completeness because allthe original concepts (report, publication, topic area, department, andcontractor) are kept intact. It expresses minimality because of thetransformation of “dept-name” from attribute in schema 1 to entityand attribute in schema 2.2, and the merger between schema 1 andschema 2.2 to form schema 3, and because of the elimination of “title”as an attribute of Report and of Report relationships with Topic-areaand Department. Finally, it expresses understandability in that thefinal schema actually has more meaning than the individual originalschemas.

The view integration process is one of continual refinement andreevaluation. It should also be noted that minimality may not always bethe most efficient way to proceed. If, for example, the elimination of theredundant relationships “publishes” and/or “contains” from schema 3.1to 3.2 causes the time required to perform certain queries to be exces-sively long, it may be better from a performance viewpoint to leavethem in. This decision could be made during the analysis of the transac-tions on the database or during the testing phase of the fully imple-mented database.

4.5 Entity Clustering for ER Models

This section presents the concept of entity clustering, which abstractsthe ER schema to such a degree that the entire schema can appear on asingle sheet of paper or a single computer screen. This has happy conse-quences for the end user and database designer in terms of developing amutual understanding of the database contents and formally document-ing the conceptual model.

An entity cluster is the result of a grouping operation on a collectionof entities and relationships. Entity clustering is potentially useful fordesigning large databases. When the scale of a database or informationstructure is large and includes a large number of interconnectionsamong its different components, it may be very difficult to understandthe semantics of such a structure and to manage it, especially for the endusers or managers. In an ER diagram with 1,000 entities, the overall


4.5 Entity Clustering for ER Models 75

structure will probably not be very clear, even to a well-trained databaseanalyst. Clustering is therefore important because it provides a methodto organize a conceptual database schema into layers of abstraction, andit supports the different views of a variety of end users.

4.5.1 Clustering Concepts

One should think of grouping as an operation that combines entitiesand their relationships to form a higher-level construct. The result of agrouping operation on simple entities is called an entity cluster. A group-ing operation on entity clusters, or on combinations of elementary enti-ties and entity clusters, results in a higher-level entity cluster. The high-est-level entity cluster, representing the entire database conceptualschema, is called the root entity cluster.

Figure 4.8a illustrates the concept of entity clustering in a simplecase where (elementary) entities R-sec (report section), R-abbr (reportabbreviation), and Author are naturally bound to (dominated by) theentity Report; and entities Department, Contractor, and Project are notdominated. (Note that to avoid unnecessary detail, we do not includethe attributes of entities in the diagrams.) In Figure 4.8b, the dark-bor-dered box around the entity Report and the entities it dominates definesthe entity cluster Report. The dark-bordered box is called the EC box torepresent the idea of an entity cluster. In general, the name of the entitycluster need not be the same as the name of any internal entity; how-ever, when there is a single dominant entity, the names are often thesame. The EC box number in the lower-right corner is a clustering-levelnumber used to keep track of the sequence in which clustering is done.The number 2.1 signifies that the entity cluster Report is the first entitycluster at level 2. Note that all the original entities are considered to beat level 1.

The higher-level abstraction, the entity cluster, must maintain thesame relationships between entities inside and outside the entity clusteras occur between the same entities in the lower-level diagram. Thus, theentity names inside the entity cluster should appear just outside the ECbox along the path of their direct relationship to the appropriatelyrelated entities outside the box, maintaining consistent interfaces (rela-tionships) as shown in Figure 4.8b. For simplicity, we modify this ruleslightly: If the relationship is between an external entity and the domi-nant internal entity (for which the entity cluster is named), the entitycluster name need not be repeated outside the EC box. Thus, in Figure4.8b, we could drop the name Report both places it occurs outside the



Report box, but we must retain the name Author, which is not the nameof the entity cluster.

4.5.2 Grouping Operations

Grouping operations are the fundamental components of the entityclustering technique. They define what collections of entities and rela-tionships comprise higher-level objects, the entity clusters. The opera-tions are heuristic in nature and include (see Figure 4.9):

Figure 4.8 Entity clustering concepts

N NN1

NN

1

N

1N

has

(a) ER model before clustering

Report

Author Project

Department Contractor

has does

does

hasin

1 1

R-abbrR-sec

(b) ER model after clustering

N N ReportReport

NN

AuthorProject

Department Contractor

has does

does

1 1

Report(entity cluster)

2.1



• Dominance grouping

• Abstraction grouping

• Constraint grouping

• Relationship grouping

These grouping operations can be applied recursively or used in avariety of combinations to produce higher-level entity clusters, that is,clusters at any level of abstraction. An entity or entity cluster may be an

Figure 4.9 Grouping operations

(a) Dominance grouping (b) Abstraction grouping

(c) Constraint grouping (d) Relationship grouping



object that is subject to combinations with other objects to form thenext higher level. That is, entity clusters have the properties of entitiesand can have relationships with any other objects at any equal or lowerlevel. The original relationships among entities are preserved after allgrouping operations, as illustrated in Figure 4.8.

Dominant objects or entities normally become obvious from the ERdiagram or the relationship definitions. Each dominant object isgrouped with all its related nondominant objects to form a cluster. Weakentities can be attached to an entity to make a cluster. Multilevel dataobjects using abstractions such as generalization and aggregation can begrouped into an entity cluster. The supertype or aggregate entity name isused as the entity cluster name. Constraint-related objects that extendthe ER model to incorporate integrity constraints, such as the exclusive-OR can be grouped into an entity cluster. Additionally, ternary or higherdegree relationships potentially can be grouped into an entity cluster.The cluster represents the relationship as a whole.

4.5.3 Clustering Technique

The grouping operations and their order of precedence determine theindividual activities needed for clustering. We can now learn how tobuild a root entity cluster from the elementary entities and relationshipsdefined in the ER modeling process. This technique assumes that a top-down analysis has been performed as part of the database requirementanalysis and that the analysis has been documented so that the majorfunctional areas and subareas are identified. Functional areas are oftendefined by an enterprise’s important organizational units, business activ-ities, or, possibly, by dominant applications for processing information.As an example, recall Figure 4.3 (reconstructed in Figure 4.10), whichcan be thought of as having three major functional areas: companyorganization (division, department), project management (project, skill,location, employee), and employee data (employee, manager, secretary,engineer, technician, prof-assoc, and desktop). Note that the functionalareas are allowed to overlap. Figure 4.10 uses an ER diagram resultingfrom the database requirement analysis to show how clustering involvesa series of bottom-up steps using the basic grouping operations. The fol-lowing list explains these steps.

1. Define points of grouping within functional areas. Locate the domi-nant entities in a functional area through natural relationships,local n-ary relationships, integrity constraints, abstractions, or



Figure 4.10 ER diagram: clustering technique

111N

belongs-to

N

N

is-allocated

1

1

has-allocated

1

1

is-managed-by

contains

is-headed-byhas

d

1

1

N

1

1

11N

EmployeeN

1

Project

TechnicianEngineerSecretaryManager

Prof-assocWorkstationDesktop

skill-used

assigned-to

Department

Division

N

N

Location

Skill

is-allocated

1

1

is-married-to

manages

+

N

N

Employeedatafunctionalarea

Projectmanagementfunctionalarea

Companyorganizationfunctionalarea



just the central focus of many simple relationships. If such pointsof grouping do not exist within an area, consider a functionalgrouping of a whole area.

2. Form entity clusters. Use the basic grouping operations on elemen-tary entities and their relationships to form higher-level objects,or entity clusters. Because entities may belong to several potentialclusters, we need to have a set of priorities for forming entity clus-ters. The following set of rules, listed in priority order, defines theset that is most likely to preserve the clarity of the conceptualmodel:

a. Entities to be grouped into an entity cluster should existwithin the same functional area; that is, the entire entity clus-ter should occur within the boundary of a functional area. Forexample, in Figure 4.10, the relationship between Departmentand Employee should not be clustered unless Employee isincluded in the company organization functional area withDepartment and Division. In another example, the relation-ship between the supertype Employee and its subtypes couldbe clustered within the employee data functional area.

b. If a conflict in choice between two or more potential entityclusters cannot be resolved (e.g., between two constraintgroupings at the same level of precedence), leave these entityclusters ungrouped within their functional area. If that func-tional area remains cluttered with unresolved choices, definefunctional subareas in which to group unresolved entities,entity clusters, and their relationships.

3. Form higher-level entity clusters. Apply the grouping operationsrecursively to any combination of elementary entities and entityclusters to form new levels of entity clusters (higher-level objects).Resolve conflicts using the same set of priority rules given in step2. Continue the grouping operations until all the entity represen-tations fit on a single page without undue complexity. The rootentity cluster is then defined.

4. Validate the cluster diagram. Check for consistency of the interfaces(relationships) between objects at each level of the diagram. Ver-ify the meaning of each level with the end users.

The result of one round of clustering is shown in Figure 4.11, whereeach of the clusters is shown at level 2.


4.6 Summary 81

4.6 Summary

Conceptual data modeling, using either the ER or UML approach, is par-ticularly useful in the early steps of the database life cycle, which involverequirements analysis and logical design. These two steps are often donesimultaneously, particularly when requirements are determined frominterviews with end users and modeled in terms of data-to-data relation-

Figure 4.11 Clustering results

Division/DepartmentCluster

is-managed-by

is-headed-byhas

d

N

N

11 1

1 1N

Employee

Department Division

2.1

Managercluster

2.3

Secretarycluster

2.4

Engineercluster

2.5

Technician

skill-used

assigned-to

Projectmanagementcluster

2.2

N

Project

ProjectP1

N

Skill

Location

is-married-to

manages

11

1N +



ships and process-to-data relationships. The conceptual data modelingstep (ER approach) involves the classification of entities and attributesfirst, then the identification of generalization hierarchies and otherabstractions, and finally the definition of all relationships among enti-ties. Relationships may be binary (the most common), ternary, andhigher-level n-ary. Data modeling of individual requirements typicallyinvolves creating a different view for each end user’s requirements. Thenthe designer must integrate those views into a global schema, so that theentire database is pictured as an integrated whole. This helps to elimi-nate needless redundancy—such elimination is particularly important inlogical design. Controlled redundancy can be created later, at the physi-cal design level, to enhance database performance. Finally, an entitycluster is a grouping of entities and their corresponding relationshipsinto a higher-level abstract object. Clustering promotes the simplicitythat is vital for fast end-user comprehension. In the next chapter we takethe global schema produced from the conceptual data modeling andview integration steps, and we transform it into SQL tables. The SQL for-mat is the end product of logical design, which is still independent ofany particular database management system.


Conceptual data modeling is defined in Tsichritzis and Lochovsky[1982], Brodie, Mylopoulos, and Schmidt [1984], Nijssen and Halpin[1989], Batini, Ceri, and Navathe [1992]. Discussion of the requirementsdata collection process can be found in Martin [1982], Teorey and Fry[1982], and Yao [1985]. View integration has progressed from a represen-tation tool [Smith and Smith, 1977] to heuristic algorithms [Batini, Len-zerini, and Navathe, 1986; Elmasri and Navathe, 2003]. These algo-rithms are typically interactive, allowing the database designer to makedecisions based on suggested alternative integration actions. A variety ofentity clustering models have been defined that provide a useful founda-tion for the clustering technique shown here [Feldman and Miller, 1986;Dittrich, Gotthard, and Lockemann, 1986; Teorey et al., 1989].


83

5Transforming the Conceptual Data Model to SQL

his chapter focuses on the database life cycle step that is of particularinterest when designing relational databases: transformation of the

conceptual data model to candidate tables and their definition in SQL[step II(c)]. There is a natural evolution from the ER and UML data mod-els to a relational schema. The evolution is so natural, in fact, that it sup-ports the contention that conceptual data modeling is an effective earlystep in relational database development. This contention has beenproven to some extent by the widespread commercialization and use ofsoftware design tools that support not only conceptual data modelingbut also the automatic conversion of these models to vendor-specificSQL table definitions and integrity constraints.

In this chapter we assume the applications to be Online TransactionProcessing (OLTP). Note that Online Analytical Processing (OLAP) appli-cations are the subject of Chapter 8.

5.1 Transformation Rules and SQL Constructs

Let’s first look at the ER and UML modeling constructs in detail to seehow the rules about transforming the conceptual data model to SQLtables are defined and applied. Our example is drawn from the companypersonnel and project conceptual schemas illustrated in Figure 4.3 (seeChapter 4).

T


84 CHAPTER 5 Transforming the Conceptual Data Model to SQL

The basic transformations can be described in terms of the threetypes of tables they produce:

• SQL table with the same information content as the original entityfrom which it is derived. This transformation always occurs for enti-ties with binary relationships (associations) that are many-to-many, one-to-many on the “one” (parent) side, or one-to-one oneither side; entities with binary recursive relationships that aremany-to-many; and entities with any ternary or higher-degreerelationship or a generalization hierarchy.

• SQL table with the embedded foreign key of the parent entity. Thistransformation always occurs for entities with binary relation-ships that are one-to-many for the entity on the “many” (child)side, for one-to-one relationships for one of the entities, and foreach entity with a binary recursive relationship that is one-to-oneor one-to-many. This is one of the two most common ways designtools handle relationships, by prompting the user to define a for-eign key in the child table that matches a primary key in the par-ent table.

• SQL table derived from a relationship, containing the foreign keys of allthe entities in the relationship. This transformation always occursfor relationships that are binary and many-to-many, relationshipsthat are binary recursive and many-to-many, and all relationshipsthat are of ternary or higher degree. This is the other most com-mon way design tools handle relationships in the ER and UMLmodels. A many-to-many relationship can only be defined interms of a table that contains foreign keys that match the primarykeys of the two associated entities. This new table may also con-tain attributes of the original relationship—for example, a rela-tionship “enrolled-in” between two entities Student and Coursemight have the attributes “term” and “grade,” which are associ-ated with a particular enrollment of a student in a particularcourse.

The following rules apply to handling SQL null values in these trans-formations:

• Nulls are allowed in an SQL table for foreign keys of associated(referenced) optional entities.

• Nulls are not allowed in an SQL table for foreign keys of associ-ated (referenced) mandatory entities.


5.1 Transformation Rules and SQL Constructs 85

• Nulls are not allowed for any key in an SQL table derived from amany-to-many relationship, because only complete row entriesare meaningful in the table.

Figures 5.1 through 5.8 show the SQL create table statements thatcan be derived from each type of ER or UML model construct. Note thattable names are shown in boldface for readability. Note also that in eachSQL table definition, the term “primary key” represents the key of thetable that is to be used for indexing and searching for data.

5.1.1 Binary Relationships

A one-to-one binary relationship between two entities is illustrated inFigure 5.1, parts a through c. Note that the UML equivalent binary asso-ciation is given in Figure 5.2, parts a through c.

When both entities are mandatory (Figure 5.1a), each entitybecomes a table, and the key of either entity can appear in the otherentity’s table as a foreign key. One of the entities in an optional relation-ship (see Department in Figure 5.1b) should contain the foreign key ofthe other entity in its transformed table. Employee, the other entity inFigure 5.1b, could also contain a foreign key (dept_no) with nullsallowed, but this would require more storage space because of the muchgreater number of Employee entity instances than Departmentinstances. When both entities are optional (Figure 5.1c), either entitycan contain the embedded foreign key of the other entity, with nullsallowed in the foreign keys.

The one-to-many relationship can be shown as either mandatory oroptional on the “many” side, without affecting the transformation. Onthe “one” side it may be either mandatory (Figure 5.1d) or optional (Fig-ure 5.1e). In all cases the foreign key must appear on the “many” side,which represents the child entity, with nulls allowed for foreign keysonly in the optional “one” case. Foreign key constraints are set accord-ing to the specific meaning of the relationship and may vary from onerelationship to another.

The many-to-many relationship, shown in Figure 5.1f as optional forboth entities, requires a new table containing the primary keys of bothentities. The same transformation applies to either the optional or manda-tory case, including the fact that the not null clause must appear for theforeign keys in both cases. Note also that an optional entity means thatthe SQL table derived from it may have zero rows for that particular rela-tionship. This does not affect “null” or “not null” in the table definition.


@Spy


Figure 5.1 ER model: one-to-one binary relationship between two entities

(b) One-to-one, one entity optional, one mandatory

(a) One-to-one, both entities mandatory

(c) One-to-one, both entities optional

Desktop

Engineer

Employee

managed-by

1

1

has-allocated

1

1

Department

Abbreviation

Report

1

1

has-abbr

Every department must have a manager, but anemployee can be a manager of at most one department.

Some desktop computers are allocated to engineers,but not necessarily to all engineers.

Every report has one abbreviation, and everyabbreviation represents exactly one report.

create table(report_no integer, report_name varchar(256), primary key(report_no);

report

create table(abbr_no char(6), report_no integer not null unique, primary key (abbr_no), foreign key (report_no) references

on delete cascade on update cascade);

abbreviation

report

create table(dept_no integer, dept_name char(20), mgr_id char(10) not null unique, primary key (dept_no), foreign key (mgr_id) references

on delete set default on update cascade);

department

employee

create table(emp_id char(10), emp_name char(20), primary key (emp_id));

employee

create table(emp_id char(10), desktop_no integer, primary key (emp_id));

engineer

create table(desktop_no integer, emp_id char(10), primary key (desktop_no), foreign key (emp_id) references

on delete set null on update cascade);

desktop

engineer


@Spy



(f) Many-to-many, both entities optional

(e) One-to-many, one entity optional, one mandatory

(d) One-to-many, both entities mandatory

Prof-assoc

Report

publishes

belongs-to

Department

Employee

Department

1

N

has

N

N

1

N

Every employee works in exactly one department, andeach department has at least one employee.

Each department publishes one or more reports. A givenreport may not necessarily be published by a department.

Every professional association could have none, one, ormany engineer members. Each engineer could be a memberof none, one, or many professional associations.

create table(dept_no integer, dept_name char(20), primary key (dept_no));

department

create table(emp_id char(10), emp_name char(20), dept_no integer not null, primary key (emp_id), foreign key (dept_no) references


employee

department

create table(dept_no integer, dept_name char(20), primary key (dept_no));

department

create table(report_no integer, dept_no integer, primary key (report_no), foreign key (dept_no) references department


report

Engineer

create table(emp_id char(10), primary key (emp_id));

engineer

create table(assoc_name varchar(256), primary key (assoc_name));

prof_assoc

create table(emp_id char(10), assoc_name varchar(256), primary key (emp_id, assoc_name), foreign key (emp_id) references

on delete cascade on update cascade, foreign key (assoc_name) references


belongs_to

engineer

prof_assoc


@Spy


Figure 5.2 UML: one-to-one binary relationship between two entities

Report

Abbreviation

(a) one-to-one, both entities mandatory

1

1

has-abbr

Every report has one abbreviation, and everyabbreviation represents exactly one report.

create table abbreviation (abbr_no char(6), report_no integer not null unique, primary key (abbr_no), foreign key (report_no) references report


create table report (report_no integer, report_name varchar(256), primary key(report_no));

Engineer

Desktop

(c) one-to-one, both entities optional

has-allocated

Some desktop computers are allocated to engineers,but not necessarily to all engineers.

create table desktop (desktop_no integer, emp_id char(10), primary key (desktop_no), foreign key (emp_id) references engineer


create table engineer (emp_id char(10), desktop_no integer, primary key (emp_id));

Department

Employee

(b) one-to-one, one entity optional, one mandatory

0..1

1

managed-by

Every department must have a manager, but anemployee can be a manager of at most one department.

create table employee (emp_id char(10), emp_name char(20), primary key (emp_id));

create table department (dept_no integer, dept_name char(20), mgr_id char(10) not null unique, primary key (dept_no), foreign key (mgr_id) references employee


0..1

0..1


@Spy



Department

Employee

(d) one-to-many, both entities mandatory

1

*

Every employee works in exactly one department,and each department has at least one employee.

create table employee

create table department (dept_no integer, dept_name char(20), primary key (dept_no));

Engineer

Prof-assoc

(f) many-to-many, both entities optional

Department

Report

(e) one-to-many, one entity optional, one mandatory

0..1

*

Each department publishes one or more reports. A givenreport may not necessarily be published by a department.

create table report (report_no integer, dept_no integer, primary key (report_no), foreign key (dept_no) references department on delete set null on update cascade);

create table department (dept_no integer, dept_name char(20), primary key (dept_no));

0..1

0..1

Every professional association could have none, one, ormany engineer members. Each engineer could be a memberof none, one, or many professional associations.

create table engineer (emp_id char(10), primary key (emp_id));

create table prof_assoc (assoc_name varchar(256), primary key (assoc_name));

create table belongs_to (emp_id char(10), assoc_name varchar(256), primary key (emp_id, assoc_name), foreign key (emp_id) references engineer on delete cascade on update cascade, foreign key (assoc_name) references prof_assoc on delete cascade on update cascade);

(emp_id char(10), emp_name char (20), dept_no integer not null, primary key (emp_id), foreign key (dept_no) references department on delete set default on update cascade);


@Spy


5.1.2 Binary Recursive Relationships

A single entity with a one-to-one relationship implies some form ofentity occurrence pairing, as indicated by the relationship name. Thispairing may be completely optional, completely mandatory, or neither.In all of these cases (Figure 5.3a for ER and Figure 5.4a for UML), the

Figure 5.3 ER model: binary recursive relationship

Employee

1 1

is-married-to

(a) One-to-one, both sides optional

Employee

N N

is-coauthor-

with

(c) Many-to-many, both sides optional

(b) One-to-many, one side mandatory, many side optional

Engineer

1 N

is-group-leader

-of

Any employee is allowed to be married to anotheremployee in this company.

Engineers are divided into groups for certain projects.Each group has a leader.

Each employee has the opportunity to coauthora report with one or more other employees, or towrite the report alone.

create table(emp_id char(10), emp_name char(20), spouse_id char(10), primary key (emp_id), foreign key (spouse_id) references


employee

employee

create table(emp_id char(10), leader_id char(10) not null, primary key (emp_id), foreign key (leader_id) references


engineer

engineer

create table(emp_id char(10), emp_name char(20), primary key (emp_id));

employee

create table(author_id char(10), coauthor_id char(10), primary key (author_id, coauthor_id), foreign key (author_id) references

on delete cascade on update cascade, foreign key (coauthor_id) reference


coauthor

employee

employee


@Spy


pairing entity key appears as a foreign key in the resulting table. The twokey attributes are taken from the same domain but are given differentnames to designate their unique use. The one-to-many relationshiprequires a foreign key in the resulting table (Figure 5.3b). The foreign keyconstraints can vary with the particular relationship.

Figure 5.4 UML: binary recursive relationship

(a) one-to-one, both sides optional

Any employee is allowed to be marriedto another employee in this company.

create table employee (emp_id char(10), emp_name char(20), spouse_id char(10), primary key (emp_id), foreign key (spouse_id) references employee on delete set null on update cascade);

Employee

0..1

0..1

is-married-to

(b) one-to-many, one side mandatory, many side optional

Engineers are divided into groups for certainprojects. Each group has a leader.

create table engineer (emp_id char(10), leader_id char(10) not null, primary key (emp_id), foreign key (leader_id) references engineer on delete set default on update cascade);

Engineer

1

0..*

is-group-leader-of

is-led-by

(c) many-to-many, both sides optional

Each employee has the opportunity to coauthor areport with one or more other employees, or towrite the report alone.


create table coauthor (author_id char(10), coauthor_id char(10), primary key (author_id, coauthor_id), foreign key (author_id) references employee on delete cascade on update cascade, foreign key (coauthor_id) references employee on delete cascade on update cascade);

Employee

0..*

0..*

is-coauthor-with


@Spy


The many-to-many relationship is shown as optional (Figure 5.3c)and results in a new table; it could also be defined as mandatory (usingthe word “must” instead of “may”); both cases have the foreign keysdefined as “not null.” In many-to-many relationships, foreign key con-straints on delete and update must always be cascade, because eachentry in the SQL table depends on the current value or existence of thereferenced primary key.

5.1.3 Ternary and n-ary Relationships

An n-ary relationship has (n + 1) possible variations of connectivity: all nsides with connectivity “one;” (n – 1) sides with connectivity “one,” andone side with connectivity “many;” (n – 2) sides with connectivity “one”and two sides with “many;” and so on until all sides are “many.”

The four possible varieties of a ternary relationship are shown in Fig-ure 5.5 for the ER model and Figure 5.6 for UML. All variations are trans-formed by creating an SQL table containing the primary keys of all enti-ties; however, in each case the meaning of the keys is different. When allthree relationships are “one” (Figure 5.5a), the resulting SQL table con-sists of three possible distinct keys. This arrangement represents the factthat three FDs are needed to describe this relationship. The optionalityconstraint is not used here because all n entities must participate inevery instance of the relationship to satisfy the FD constraints. (SeeChapter 6 for more discussion of functional dependencies.)

In general the number of entities with connectivity “one” deter-mines the lower bound on the number of FDs. Thus, in Figure 5.3b,which is one-to-one-to-many, there are two FDs; in Figure 5.5c, which isone-to-many-to-many, there is only one FD. When all relationships are“many” (Figure 5.5d), the relationship table is all one composite key,unless the relationship has its own attributes. In that case the key is thecomposite of all three keys from the three associated entities.

Foreign key constraints on delete and update for ternary relation-ships transformed to SQL tables must always be cascade, because eachentry in the SQL table depends on the current value of, or existence of,the referenced primary key.


@Spy


Figure 5.5 ER model: ternary and n-ary relationships

Technician

NotebookProject

1

1 1uses-notebook


uses_notebook

emp_id project_name notebook_no

35

35

42

42

81

93

93

alpha

gamma

delta

epsilon

gamma

alpha

beta

5001

2008

1004

3005

1007

1009

5001

A technician uses exactlyone notebook for each project.Each notebook belongs to onetechnician for each project. Notethat a technician may still workon many projects and maintaindifferent notebooks for differentprojects.

create table (emp_id char(10),technician

create table (project_name char(20),project

create table (notebook_no integer,notebook

create table (emp_id char(10),uses_notebook

primary key (emp_id));

primary key (project_name));

primary key (notebook_no));

project_name char(20), notebook_no integer not null, primary key (emp_id, project_name), foreign key (emp_id) references

on delete cascade on update cascade, foreign key (project_name) references

on delete cascade on update cascade, foreign key (notebook_no) references

on delete cascade on update cascade, unique (emp_id, notebook_no), unique (project_name, notebook_no));

technician

project

notebook

(a) One-to-one-to-one ternary relationship

emp_id, project_name notebook_noemp_id, notebook_no project_nameproject_name, notebook_no emp_id

→→

→


@Spy



Employee

LocationProject

N

1 1assigned-to


assigned_to

emp_id project_name loc_name

48101

48101

20702

20702

51266

51266

76323

forest

ocean

ocean

river

river

ocean

hills

B66

E71

A12

D54

G14

A12

B66

Each employee assigned to aproject works at only one locationfor that project, but can be at adifferent location for a differentproject. At a given location, anemployee works on only oneproject. At a particular location,there can be many employeesassigned to a given project.

create table (emp_id char(10),employee


create table (loc_name char(15),location

create table (emp_id char(10),assigned_to

primary key (emp_id)); emp_name char(20),


primary key (loc_name));

project_name char(20), loc_name char(15) not null, primary key (emp_id, project_name), foreign key (emp_id) references


on delete cascade on update cascade, foreign key (loc_name) references

on delete cascade on update cascade, unique (emp_id, loc_name));

employee

project

location

(b) One-to-one-to-many ternary relationships

emp_id, loc_name project_nameemp_id, project_name loc_name

→→


@Spy



Project

EngineerManager

N

1 Nassigned-to


manages

emp_idproject_name mgr_id

4106

4200

7033

4200

4106

7033

4106

4106

alpha

alpha

beta

beta

gamma

delta

delta

iota

27

27

32

14

71

55

39

27

Each engineer working on aparticular project has exactly onemanager, but a project can havemany managers and an engineermay have many managers andmany projects. A manager maymanage several projects.


create table (mgr_id char(10),manager

create table (emp_id char(10),engineer

create table (project_name char(20),manages


primary key (mgr_id));

primary key (emp_id));

mgr_id char(10) not null, emp_id char(10), primary key (project_name, emp_id), foreign key (project_name) references

on delete cascade on update cascade, foreign key (mgr_id) references

on delete cascade on update cascade, foreign key (emp_id) references


project

manager

engineer

(c) One-to-many-to-many ternary relationships

project_name, emp_id mgr_id→


@Spy



Employee

ProjectSkill

N

N Nskill-used


skill_used

emp_id project_nameskill_type

101

101

101

101

102

102

102

105

electronics

electronics

algebra

calculus

mechanics

mechanics

algebra

geometry

electronics

electronics

algebra

set theory

mechanics

mechanics

geometry

topology

Employees can use different skillson any one of many projects, andeach project has many employeeswith various skills.

create table (emp_id char(10),employee

create table (skill_type char(15),skill


create table (emp_id char(10),skill_used

primary key (emp_id)); emp_name char(20),

primary key (skill_type));


skill_type char(15), project_name char(20), primary key (emp_id, skill_type, project_name), foreign key (emp_id) references

on delete cascade on update cascade, foreign key (skill_type) references



None

employee

skill

project

(d) Many-to-many-to-many ternary relationships


@Spy


Figure 5.6 UML: ternary and n-ary relationships

Technician

Project

emp_id project_name notebook_no

35 alpha 5001

35 gamma 2008

42 delta 1004

42 epsilon 3005

81 gamma 1007

93 alpha 1009

93 beta 5001

Notebook1 1

1

uses-notebook

A technician uses exactly one notebook for each project. Each notebook belongs to one technician for each project. Note that a technician may still work on many projects and maintain different note-books for different projects.

create table technician (emp_id char(10), primary key (emp_id));

create table project (project_name char(20), primary key (project_name));

create table notebook (notebook_no integer, primary key (notebook_no));

create table uses_notebook (emp_id char(10), project_name char(20), notebook_no integer not null, primary key (emp_id, project_name), foreign key (emp_id) references technician on delete cascade on update cascade, foreign key (project_name) references project on delete cascade on update cascade, foreign key (notebook_no) references notebook on delete cascade on update cascade, unique (emp_id, notebook_no), unique (project_name, notebook_no));

uses_notebook

emp_id, project_name → notebook_noemp_id, notebook_no → project_nameproject_name, notebook_no → emp_id

(a) one-to-one-to-one ternary association



@Spy



emp_id project_name loc_name

48101 forest B66

48101 ocean E71

20702 ocean A12

20702 river D54

51266 river G14

51266 ocean A12

76323 hills B66

Each employee assigned to a project works at only one location for that project, but can be at a different location for a different project. At a given location, an employee works on only one project. At a particular location there can be many employees assigned to a given project.



create table location (loc_name char(15), primary key (loc_name));

create table assigned_to (emp_id char(10), project_name char(20), loc_name char(15) not null, primary key (emp_id, project_name), foreign key (emp_id) references employee on delete cascade on update cascade, foreign key (project_name) references project on delete cascade on update cascade, foreign key (loc_name) references location on delete cascade on update cascade, unique (emp_id, loc_name));

assigned_to

(b) one-to-one-to-many ternary associations

emp_id, loc_name → project_nameemp_id, project_name → loc_name


Employee

Project Location1 1

*

assigned-to


@Spy



emp_idproject_name mgr_id

4106alpha 27

4200alpha 27

7033beta 32

4200beta 14

4106gamma 71

7033delta 55

4106delta 39

4106iota 27

Each engineer working on a particular project has exactly one manager, but a project can have many managers and an engineer may have many managers and many projects. A manager may manage several projects.


(c) one-to-many-to-many ternary association

create table manager (mgr_id char(10), primary key (mgr_id));

create table engineer (emp_id char(10), primary key (emp_id));

create table manages (project_name char(20), mgr_id char(10) not null, emp_id char(10), primary key (project_name, emp_id), foreign key (project_name) references project on delete cascade on update cascade, foreign key (mgr_id) references manager on delete cascade on update cascade, foreign key (emp_id) references engineer on delete cascade on update cascade);

manages

project_name, emp_id → mgr_id


Project

Manager Engineer1 *

*

manages


@Spy



emp_id project_nameskill_type

101 electronicsalgebra

101 electronicscalculus

101 mechanicsalgebra

101 mechanicsgeometry

102 electronicsalgebra

102 electronicsset-theory

102 mechanicsgeometry

105 mechanicstopology

Employees can use different skills on any one of many projects, and each project has many employees with various skills.


create table skill (skill_type char(15), primary key (skill_type));


create table skill_used (emp_id char(10), skill_type char(15), project_name char(20), primary key (emp_id, skill_type, project_name), foreign key (emp_id) references employee on delete cascade on update cascade, foreign key (skill_type) references skill on delete cascade on update cascade, foreign key (project_name) references project on delete cascade on update cascade);

skill_used

(d) many-to-many-to-many ternary association

None


Employee

Skill Project* *

*

skill-used


@Spy


5.1.4 Generalization and Aggregation

The transformation of a generalization abstraction can produce separateSQL tables for the generic or supertype entity and each of the subtypes(Figure 5.7 for the ER model and Figure 5.8 for UML). The table derivedfrom the supertype entity contains the supertype entity key and all com-mon attributes. Each table derived from subtype entities contains thesupertype entity key and only the attributes that are specific to that sub-type. Update integrity is maintained by requiring all insertions and dele-tions to occur in both the supertype table and relevant subtype table—that is, the foreign key constraint cascade must be used. If the update isto the primary key of the supertype table, then all subtype tables, as wellas the supertype table, must be updated. An update to a nonkey attributeaffects either the supertype or one subtype table, but not both. The

Figure 5.7 ER model: generalization and aggregation

create table (indiv_id char(10),individual indiv_name char(20), indiv_addr char(20), primary key (indiv_id));

create table (emp_id char(10),employee job_title char(15), primary key (emp_id), foreign key (emp_id) references individual


create table (cust_no char(10),customer cust_credit char(12), primary key (cust_no), foreign key (cust_no) references individual


Individual An individual may be either anemployee or a customer, or both, or neither.

Employee Customer


@Spy


transformation rules (and integrity rules) are the same for both the dis-joint and overlapping subtype generalizations.

Another approach is to have a single table that includes all attributesfrom the supertype and subtypes (the whole hierarchy in one table),with nulls used when necessary. A third possibility is one table for eachsubtype, pushing down the common attributes into the specific sub-types. There are advantages and disadvantages to each of these threeapproaches. Several software tools now support all three options [Fowler2003; Ambler, 2003].

Database practitioners often add a discriminator to the supertypewhen they implement generalization. The discriminator is an attributethat has a separate value for each subtype and indicates which subtypeto use to get further information. This approach works, up to a point.However, there are situations requiring multiple levels of supertypes andsubtypes, where more than one discriminator may be required.

The transformation of an aggregation abstraction also produces aseparate table for the supertype entity and each subtype entity. However,

Figure 5.8 UML: generalization and aggregation

Individual

Employee Customer

create table individual (indiv_id char(10), indiv_name char(20), indiv_addr char(20), primary key (indiv_id));

create table employee (emp_id char(10), job_title char(15), primary key (emp_id), foreign key (emp_id) references individual on delete cascade on update cascade);

create table customer (cust_no char(10), cust_credit char(12), primary key (cust_no), foreign key (cust_no) references individual on delete cascade on update cascade);

An individual maybe either an employeeor a customer, or both,or neither.


@Spy

5.2 Transformation Steps 103

there are no common attributes and no integrity constraints to main-tain. The main function of aggregation is to provide an abstraction toaid the view integration process. In UML, aggregation is a compositionrelationship, not a type relationship, which corresponds to a weak entity[Muller, 1999].

5.1.5 Multiple Relationships

Multiple relationships among n entities are always considered to be com-pletely independent. One-to-one, one-to-many binary, or binary recur-sive relationships resulting in tables that are either equivalent or differonly in the addition of a foreign key can simply be merged into a singletable containing all the foreign keys. Many-to-many or ternary relation-ships that result in SQL tables tend to be unique and cannot be merged.

5.1.6 Weak Entities

Weak entities differ from entities only in their need for keys from otherentities to establish their uniqueness. Otherwise, they have the sametransformation properties as entities, and no special rules are needed.When a weak entity is already derived from two or more entities in theER diagram, it can be directly transformed into a table without furtherchange.

5.2 Transformation Steps

The following list summarizes the basic transformation steps from an ERdiagram to SQL tables:

• Transform each entity into a table containing the key and non-key attributes of the entity

• Transform every many-to-many binary or binary recursive rela-tionship into a table with the keys of the entities and theattributes of the relationship

• Transform every ternary or higher-level n-ary relationship into atable

Now let us study each step in turn.


@Spy


5.2.1 Entity Transformation

If there is a one-to-many relationship between two entities, add the keyof the entity on the “one” side (the parent) into the child table as a for-eign key. If there is a one-to-one relationship between one entity andanother entity, add the key of one of the entities into the table for theother entity, thus changing it to a foreign key. The addition of a foreignkey due to a one-to-one relationship can be made in either direction.One strategy is to maintain the most natural parent-child relationshipby putting the parent key into the child table. Another strategy is basedon efficiency: add the foreign key to the table with fewer rows.

Every entity in a generalization hierarchy is transformed into a table.Each of these tables contains the key of the supertype entity; in reality,the subtype primary keys are foreign keys as well. The supertype tablealso contains nonkey values that are common to all the relevant entities;the other tables contain nonkey values specific to each subtype entity.

SQL constructs for these transformations may include constraints fornot null, unique, and foreign key. A primary key must be specified foreach table, either explicitly from among the keys in the ER diagram orby taking the composite of all attributes as the default key. Note that theprimary key designation implies that the attribute is not null or unique.It is important to note, however, that not all DBMSs follow the ANSIstandard in this regard—it may be possible in some systems to create aprimary key that can be null. We recommend that you specify “not null”explicitly for all key attributes.

5.2.2 Many-to-Many Binary Relationship Transformation

In this step, every many-to-many binary relationship is transformed intoa table containing the keys of the entities and the attributes of the rela-tionship. The resulting table will show the correspondence between spe-cific instances of one entity and those of another entity. Any attribute ofthis correspondence, such as the elected office an engineer has in a pro-fessional association (Figure 5.1f), is considered intersection data and isadded to the table as a nonkey attribute.

SQL constructs for this transformation may include constraints fornot null. The unique constraint is not used here because all keys arecomposites of the participating primary keys of the associated entities inthe relationship. The constraints for primary key and foreign key arerequired, because a table is defined as containing a composite of the pri-mary keys of the associated entities.


@Spy

5.2 Transformation Steps 105

5.2.3 Ternary Relationship Transformation

In this step, every ternary (or higher n-ary) relationship is transformedinto a table. Ternary or higher n-ary relationships are defined as a collec-tion of the n primary keys in the associated entities in that relationship,with possibly some nonkey attributes that are dependent on the keyformed by the composite of those n primary keys.

SQL constructs for this transformation must include constraints fornot null, since optionality is not allowed. The unique constraint is notused for individual attributes, because all keys are composites of the par-ticipating primary keys of the associated entities in the relationship. Theconstraints for primary key and foreign key are required because a tableis defined as a composite of the primary keys of the associated entities.The unique clause must also be used to define alternate keys that oftenoccur with ternary relationships. Note that a table derived from an n-aryrelationship has n foreign keys.

5.2.4 Example of ER-to-SQL Transformation

ER diagrams for the company personnel and project database (Chapter4) can be transformed to SQL tables. A summary of the transformationof entities and relationships to SQL tables is illustrated in the followinglist.

SQL tables derived directly from entities (see Figure 4.3d):

division secretary project department engineer locationemployee technician prof_assocmanager skill desktop

SQL tables derived from many-to-many binary or many-to-manybinary recursive relationships:

• belongs_to

SQL tables transformed from ternary relationships:

• skill_used

• assigned_to


@Spy


5.3 Summary

Entities, attributes, and relationships in the ER model and classes,attributes, and associations in UML can be transformed directly into SQL(SQL-99) table definitions with some simple rules. Entities are trans-formed into tables, with all attributes mapped one-to-one to tableattributes. Tables representing entities that are the child (“many” side) ofa parent-child (one-to-many or one-to-one) relationship must alsoinclude, as a foreign key, the primary key of the parent entity. A many-to-many relationship is transformed into a table that contains the pri-mary keys of the associated entities as its composite primary key; thecomponents of that key are also designated as foreign keys in SQL. A ter-nary or higher-level n-ary relationship is transformed into a table thatcontains the primary keys of the associated entities; these keys are desig-nated as foreign keys in SQL. A subset of those keys can be designated asthe primary key, depending on the functional dependencies associatedwith the relationship. Rules for generalization require the inheritance ofthe primary key from the supertype to the subtype entities when trans-formed into SQL tables. Optionality constraints in the ER or UML dia-grams translate into nulls allowed in the relational model when appliedto the “one” side of a relationship. In SQL, the lack of an optionalityconstraint determines the not null designation in the create table defini-tion.


Definition of the basic transformations from the ER model to tables iscovered in McGee [1974], Wong and Katz [1979], Sakai [1983], Martin[1983], Hawryszkiewyck [1984], Jajodia and Ng [1984], and for UML inMuller [1999].


107

6Normalization

his chapter focuses on the fundamentals of normal forms for rela-tional databases and the database design step that normalizes the

candidate tables [step II(d) of the database design life cycle]. It also inves-tigates the equivalence between the conceptual data model (e.g., the ERmodel) and normal forms for tables. As we go through the examples inthis chapter, it should become obvious that good, thoughtful design of aconceptual model will result in databases that are either already normal-ized or can be easily normalized with minor changes. This truth illus-trates the beauty of the conceptual modeling approach to databasedesign, in that the experienced relational database designer will developa natural gravitation toward a normalized model from the beginning.

For most database practitioners, Sections 6.1 through 6.4 cover thecritical normalization needed for everyday use, through Boyce-Coddnormal form (BCNF). Section 6.5 covers the higher normal forms ofmostly theoretical interest; however, we do show the equivalencybetween higher normal forms and ternary relationships in the ER modeland UML for the interested reader.

6.1 Fundamentals of Normalization

Relational database tables, whether they are derived from ER or UMLmodels, sometimes suffer from some rather serious problems in terms of

T


108 CHAPTER 6 Normalization

performance, integrity and maintainability. For example, when theentire database is defined as a single large table, it can result in a largeamount of redundant data and lengthy searches for just a small numberof target rows. It can also result in long and expensive updates, and dele-tions in particular can result in the elimination of useful data as anunwanted side effect.

Such a situation is shown in Figure 6.1, where products, salespersons,customers, and orders are all stored in a single table called Sales. In thistable, we see that certain product and customer information is storedredundantly, wasting storage space. Certain queries, such as “Which cus-tomers ordered vacuum cleaners last month?” would require a search ofthe entire table. Also, updates such as changing the address of the cus-tomer Dave Bachmann would require changing many rows. Finally,deleting an order by a valued customer such as Qiang Zhu (who boughtan expensive computer), if that is his only outstanding order, deletes theonly copy of his address and credit rating as a side effect. Such informa-tion may be difficult (or sometimes impossible) to recover. These prob-lems also occur for situations in which the database has already been setup as a collection of many tables, but some of the tables are still toolarge.

If we had a method of breaking up such a large table into smallertables so that these types of problems would be eliminated, the databasewould be much more efficient and reliable. Classes of relational databaseschemes or table definitions, called normal forms, are commonly used toaccomplish this goal. The creation of a normal form database table iscalled normalization. Normalization is accomplished by analyzing the

Figure 6.1 Single table database

DVD player

Plymouth

product_name

Sales

order_no cust_name cust_addr credit date

1-3-03

4-15-05

9-12-04

12-5-04

5-10-05

8-3-02

10-1-04

4-19-99

1-4-04

3-15-04

sales_name

Carl Bloch

Ted Hanss

Dick Phillips

Fred Remley

R. Metz

Paul Basile

Carl Bloch

Carl Bloch

Dick Phillips

Fred Remley

6

10

8

3

7

8

7

8

6

6

Austin

Ann Arbor

Detroit

Chicago

Mumbai

Chicago

Detroit

Austin

East Lansing

1458

2730

2460

519

1986

1817

1865

1885

1943

2315

Dave Bachmann

Qiang Zhu

Mike Stolarchuck

Peter Honeyman

Charles Antonelli

C.V. Ravishankar

Charles Antonelli

Betsy Karmeisool

Dave Bachmann

Sakti Pramanik

vacuum cleaner

computer

refrigerator

radio

CD player

vacuum cleaner

vacuum cleaner

refrigerator

television


6.1 Fundamentals of Normalization 109

interdependencies among individual attributes associated with thosetables and taking projections (subsets of columns) of larger tables toform smaller ones.

Let us first review the basic normal forms, which have been wellestablished in the relational database literature and in practice.

6.1.1 First Normal Form

Definition. A table is in first normal form (1NF) if and only if all col-umns contain only atomic values, that is, each column can haveonly one value for each row in the table.

Relational database tables, such as the Sales table illustrated in Figure6.1, have only atomic values for each row and for each column. Suchtables are considered to be in first normal form, the most basic level ofnormalized tables.

To better understand the definition for 1NF, it helps to know the dif-ference between a domain, an attribute, and a column. A domain is theset of all possible values for a particular type of attribute, but may beused for more than one attribute. For example, the domain of people’snames is the underlying set of all possible names that could be used foreither customer-name or salesperson-name in the database table in Fig-ure 6.1. Each column in a relational table represents a single attribute,but in some cases more than one column may refer to differentattributes from the same domain. When this occurs, the table is still in1NF because the values in the table are still atomic. In fact, standard SQLassumes only atomic values and a relational table is by default in 1NF. Anice explanation of this is given in Muller [1999].

6.1.2 Superkeys, Candidate Keys, and Primary Keys

A table in 1NF often suffers from data duplication, update performance,and update integrity problems, as noted above. To understand theseissues better, however, we must define the concept of a key in the con-text of normalized tables. A superkey is a set of one or more attributes,which, when taken collectively, allows us to identify uniquely an entityor table. Any subset of the attributes of a superkey that is also a superkey,and not reducible to another superkey, is called a candidate key. A primarykey is selected arbitrarily from the set of candidate keys to be used in anindex for that table.



As an example, in Figure 6.2 a composite of all the attributes of thetable forms a superkey because duplicate rows are not allowed in therelational model. Thus, a trivial superkey is formed from the compositeof all attributes in a table. Assuming that each department address(dept_addr) in this table is single valued, we can conclude that the com-posite of all attributes except dept_addr is also a superkey. Looking atsmaller and smaller composites of attributes and making realisticassumptions about which attributes are single valued, we find that thecomposite (report_no, author_id) uniquely determines all the otherattributes in the table and is therefore a superkey. However, neitherreport_no nor author_id alone can determine a row uniquely, and thecomposite of these two attributes cannot be reduced and still be a super-key. Thus, the composite (report_no, author_id) becomes a candidatekey. Since it is the only candidate key in this table, it also becomes theprimary key.

A table can have more than one candidate key. If, for example, inFigure 6.2, we had an additional column for author_ssn, and the com-posite of report_no and author_ssn uniquely determine all the otherattributes of the table, then both (report_no, author_id) and (report_no,author_ssn) would be candidate keys. The primary key would then be anarbitrary choice between these two candidate keys.

Other examples of multiple candidate keys can be seen in Figure 5.5(see Chapter 5). In Figure 5.5a the table uses_notebook has three candi-date keys: (emp_id, project_name), (emp_id, notebook_no), and(project_name, notebook_no); and in Figure 5.5b the table assigned_tohas two candidate keys: (emp_id, loc_name) and (emp_id, project_name).Figures 5.5c and 5.5d each have only a single candidate key.

Figure 6.2 Report table

5789

design

report_no

Report

editor dept_no dept_name author_iddept_addr author_name

mantei

bolton

koenig

fry

umar

koenig

author_addr

cs-tor

mathrev

mathrev

folkstone

prise

mathrev

53

44

71

26

38

71

design

design

analysis

analysis

analysis

argus1

argus1

argus1

argus2

argus2

argus2

woolf

woolf

woolf

koenig

koenig

koenig

15

15

15

27

27

27

4216

4216

4216

5789

5789



6.1.3 Second Normal Form

To explain the concept of second normal form (2NF) and higher, weintroduce the concept of functional dependence, which was brieflydescribed in Chapter 2. The property of one or more attributes thatuniquely determine the value of one or more other attributes is calledfunctional dependence. Given a table (R), a set of attributes (B) is function-ally dependent on another set of attributes (A) if, at each instant of time,each A value is associated with only one B value. Such a functionaldependence is denoted by A -> B. In the preceding example from Figure6.2, let us assume we are given the following functional dependenciesfor the table report:

report: report_no -> editor, dept_nodept_no -> dept_name, dept_addrauthor_id -> author_name, author_addr

Definition. A table is in second normal form (2NF) if and only if it isin 1NF and every nonkey attribute is fully dependent on the primarykey. An attribute is fully dependent on the primary key if it is on theright side of an FD for which the left side is either the primary keyitself or something that can be derived from the primary key usingthe transitivity of FDs.

An example of a transitive FD in report is the following:

report_no -> dept_nodept_no -> dept_name

Therefore we can derive the FD (report_no -> dept_name), sincedept_name is transitively dependent on report_no.

Continuing our example, the composite key in Figure 6.2,(report_no, author_id), is the only candidate key and is therefore theprimary key. However, there exists one FD (dept_no -> dept_name,dept_addr) that has no component of the primary key on the left side,and two FDs (report_no -> editor, dept_no and author_id -> author_name,author_addr) that contain one component of the primary key on the leftside, but not both components. As such, report does not satisfy thecondition for 2NF for any of the FDs.



Consider the disadvantages of 1NF in table report. Report_no, edi-tor, and dept_no are duplicated for each author of the report. Therefore,if the editor of the report changes, for example, several rows must beupdated. This is known as the update anomaly, and it represents a poten-tial degradation of performance due to the redundant updating. If a neweditor is to be added to the table, it can only be done if the new editor isediting a report: both the report number and editor number must beknown to add a row to the table, because you cannot have a primary keywith a null value in most relational databases. This is known as the insertanomaly. Finally, if a report is withdrawn, all rows associated with thatreport must be deleted. This has the side effect of deleting the informa-tion that associates an author_id with author_name and author_addr.Deletion side effects of this nature are known as delete anomalies. Theyrepresent a potential loss of integrity, because the only way the data canbe restored is to find the data somewhere outside the database and insertit back into the database. All three of these anomalies represent prob-lems to database designers, but the delete anomaly is by far the mostserious because you might lose data that cannot be recovered.

These disadvantages can be overcome by transforming the 1NF tableinto two or more 2NF tables by using the projection operator on the sub-set of the attributes of the 1NF table. In this example we project reportover report_no, editor, dept_no, dept_name, and dept_addr to formreport1; and project report over author_id, author_name, andauthor_addr to form report2; and finally project report overreport_no and author_id to form report3. The projection of reportinto three smaller tables has preserved the FDs and the associationbetween report_no and author_no that was important in the originaltable. Data for the three tables is shown in Figure 6.3. The FDs for these2NF tables are:

report1: report_no -> editor, dept_nodept_no -> dept_name, dept_addr

report2: author_id -> author_name, author_addr

report3: report_no, author_id is a candidate key (no FDs)

We now have three tables that satisfy the conditions for 2NF, and wehave eliminated the worst problems of 1NF, especially integrity (thedelete anomaly). First, editor, dept_no, dept_name, and dept_addr areno longer duplicated for each author of a report. Second, an editorchange results in only an update to one row for report1. And third, themost important, the deletion of the report does not have the side effectof deleting the author information.



Not all performance degradation is eliminated, however; report_nois still duplicated for each author, and deletion of a report requiresupdates to two tables (report1 and report3) instead of one. However,these are minor problems compared to those in the 1NF table report.

Note that these three report tables in 2NF could have been generateddirectly from an ER (or UML) diagram that equivalently modeled this sit-uation with entities Author and Report and a many-to-many relation-ship between them.

6.1.4 Third Normal Form

The 2NF tables we established in the previous section represent a sig-nificant improvement over 1NF tables. However, they still suffer from

Figure 6.3 2NF tables

Report 2

author_idreport_no

Report 3

4216

4216

4216

5789

5789

5789

53

44

71

26

38

71

author_addrauthor_id author_name

53

44

71

26

38

71

mantei

bolton

koenig

fry

umar

koenig

cs-tor

mathrev

mathrev

folkstone

prise

mathrev

dept_addrdept_namedept_noeditorreport_no

Report 1

15

274216

5789

woolf

koenig

design

analysis

argus 1

argus 2



the same types of anomalies as the 1NF tables although for differentreasons associated with transitive dependencies. If a transitive (func-tional) dependency exists in a table, it means that two separate factsare represented in that table, one fact for each functional dependencyinvolving a different left side. For example, if we delete a report fromthe database, which involves deleting the appropriate rows fromreport1 and report3 (see Figure 6.3), we have the side effect of delet-ing the association between dept_no, dept_name, and dept_addr aswell. If we could project table report1 over report_no, editor, anddept_no to form table report11, and project report1 over dept_no,dept_name, and dept_addr to form table report12, we could eliminatethis problem. Example tables for report11 and report12 are shownin Figure 6.4.

Definition. A table is in third normal form (3NF) if and only if forevery nontrivial functional dependency X->A, where X and A areeither simple or composite attributes, one of two conditions musthold. Either attribute X is a superkey, or attribute A is a member ofa candidate key. If attribute A is a member of a candidate key, A iscalled a prime attribute. Note: a trivial FD is of the form YZ->Z.

Figure 6.4 3NF tables

Report 2

author_idreport_no

Report 3

4216

4216

4216

5789

5789

5789

53

44

71

26

38

71

author_addrauthor_id author_name

53

44

71

26

38

71

mantei

bolton

koenig

fry

umar

koenig

cs-tor

mathrev

mathrev

folkstone

prise

mathrev

dept_addrdept_namedept_no dept_noeditorreport_no

Report 11 Report 12

15

274216

5789

woolf

koenig

15

27

design

analysis

argus 1

argus 2



In the preceding example, after projecting report1 into report11and report12 to eliminate the transitive dependency report_no ->dept_no -> dept_name, dept_addr, we have the following 3NF tables andtheir functional dependencies (and example data in Figure 6.4):

report11: report_no -> editor, dept_noreport12: dept_no -> dept_name, dept_addrreport2: author_id -> author_name, author_addrreport3: report_no, author_id is a candidate key (no FDs)

6.1.5 Boyce-Codd Normal Form

3NF, which eliminates most of the anomalies known in databases today,is the most common standard for normalization in commercial data-bases and CASE tools. The few remaining anomalies can be eliminatedby the Boyce-Codd normal form (BCNF) and higher normal formsdefined here and in Section 6.5. BCNF is considered to be a strong varia-tion of 3NF.

Definition. A table R is in Boyce-Codd normal form (BCNF) if for everynontrivial FD X->A, X is a superkey.

BCNF is a stronger form of normalization than 3NF because it elimi-nates the second condition for 3NF, which allowed the right side of theFD to be a prime attribute. Thus, every left side of an FD in a table mustbe a superkey. Every table that is BCNF is also 3NF, 2NF, and 1NF, by theprevious definitions.

The following example shows a 3NF table that is not BCNF. Suchtables have delete anomalies similar to those in the lower normal forms.

Assertion 1. For a given team, each employee is directed by only oneleader. A team may be directed by more than one leader.

emp_name, team_name -> leader_name

Assertion 2. Each leader directs only one team.

leader_name -> team_name



This table is 3NF with a composite candidate key emp_id, team_id:

The team table has the following delete anomaly: if Sutton dropsout of the Condors team, then we have no record of Bachmann leadingthe Condors team. As shown by Date [1999], this type of anomaly can-not have a lossless decomposition and preserve all FDs. A lossless decom-position requires that when you decompose the table into two smallertables by projecting the original table over two overlapping subsets ofthe scheme, the natural join of those subset tables must result in theoriginal table without any extra unwanted rows. The simplest way toavoid the delete anomaly for this kind of situation is to create a separatetable for each of the two assertions. These two tables are partially redun-dant, enough so to avoid the delete anomaly. This decomposition is loss-less (trivially) and preserves functional dependencies, but it alsodegrades update performance due to redundancy, and necessitates addi-tional storage space. The trade-off is often worth it because the deleteanomaly is avoided.

6.2 The Design of Normalized Tables: A Simple Example

The example in this section is based on the ER diagram in Figure 6.5 andthe FDs given below. In general, FDs can be given explicitly, derivedfrom the ER diagram, or derived from intuition (that is, from experiencewith the problem domain).

1. emp_id, start_date -> job_title, end_date

2. emp_id -> emp_name, phone_no, office_no, proj_no, proj_name,dept_no

3. phone_no -> office_no

team: emp_name team_name leader_name

Sutton Hawks Wei

Sutton Condors Bachmann

Niven Hawks Wei

Niven Eagles Makowski

Wilson Eagles DeSmith


6.2 The Design of Normalized Tables: A Simple Example 117

4. proj_no -> proj_name, proj_start_date, proj_end_date

5. dept_no -> dept_name, mgr_id

6. mgr_id -> dept_no

Our objective is to design a relational database schema that is nor-malized to at least 3NF and, if possible, minimize the number of tablesrequired. Our approach is to apply the definition of third normal form(3NF) in Section 6.1.4 to the FDs given above, and create tables that sat-isfy the definition.

If we try to put FDs 1 through 6 into a single table with the compos-ite candidate key (and primary key) (emp_id, start_date), we violate the3NF definition, because FDs 2 through 6 involve left sides of FDs that arenot superkeys. Consequently, we need to separate 1 from the rest of theFDs. If we then try to combine 2 through 6, we have many transitivities.Intuitively, we know that 2, 3, 4, and 5 must be separated into differenttables because of transitive dependencies. We then must decide whether5 and 6 can be combined without loss of 3NF; this can be done becausemgr_id and dept_no are mutually dependent and both attributes are

Figure 6.5 ER diagram for employee database

emp-id

emp-name

phone-no

office-no

EmployeeN

N

N

1

1

1

1

has

works-in

manages

works-on

Emp-historyjob-title

proj-end-date

proj-start-date

proj-name

proj-no

mgr-id

dept-name

dept-no

start-date

end-date

1

Department

Project



superkeys in a combined table. Thus, we can define the following tablesby appropriate projections from 1 through 6.

emp_hist: emp_id, start_date -> job_title, end_date

employee: emp_id -> emp_name, phone_no, proj_no, dept_no

phone: phone_no -> office_no

project: proj_no -> proj_name, proj_start_date, proj_end_date

department: dept_no -> dept_name, mgr_idmgr_id -> dept_no

This solution, which is BCNF as well as 3NF, maintains all the origi-nal FDs. It is also a minimum set of normalized tables. In Section 6.4, wewill look at a formal method of determining a minimum set that we canapply to much more complex situations.

Alternative designs may involve splitting tables into partitions forvolatile (frequently updated) and passive (rarely updated) data, consoli-dating tables to get better query performance, or duplicating data in dif-ferent tables to get better query performance without losing integrity. Insummary, the measures we use to assess the trade-offs in our design are:

• Query performance (time)

• Update performance (time)

• Storage performance (space)

• Integrity (avoidance of delete anomalies)

6.3 Normalization of Candidate Tables Derived from ER Diagrams

Normalization of candidate tables [step II(d) in the database life cycle] isaccomplished by analyzing the FDs associated with those tables: explicitFDs from the database requirements analysis (Section 6.2), FDs derivedfrom the ER diagram, and FDs derived from intuition.

Primary FDs represent the dependencies among the data elements thatare keys of entities, that is, the interentity dependencies. Secondary FDs, onthe other hand, represent dependencies among data elements that com-prise a single entity, that is, the intraentity dependencies. Typically, pri-mary FDs are derived from the ER diagram, and secondary FDs areobtained explicitly from the requirements analysis. If the ER constructs do


6.3 Normalization of Candidate Tables Derived from ER Diagrams 119

not include nonkey attributes used in secondary FDs, the data require-ments specification or data dictionary must be consulted. Table 6.1 showsthe types of primary FDs derivable from each type of ER construct.

Each candidate table will typically have several primary and second-ary FDs uniquely associated with it that determine the current degree ofnormalization of the table. Any of the well-known techniques forincreasing the degree of normalization can be applied to each table tothe desired degree stated in the requirements specification. Integrity ismaintained by requiring the normalized table schema to include all datadependencies existing in the candidate table schema.

Any table B that is subsumed by another table A can potentially beeliminated. Table B is subsumed by another table A when all theattributes in B are also contained in A, and all data dependencies in Balso occur in A. As a trivial case, any table containing only a compositekey and no nonkey attributes is automatically subsumed by any othertable containing the same key attributes, because the composite key isthe weakest form of data dependency. If, however, tables A and B repre-sent the supertype and subtype cases, respectively, of entities defined bythe generalization abstraction, and A subsumes B because B has noadditional specific attributes, the designer must collect and analyze addi-tional information to decide whether or not to eliminate B.

A table can also be subsumed by the construction of a join of twoother tables (a “join” table). When this occurs, the elimination of a sub-

Table 6.1 Primary FDs Derivable from ER Relationship Constructs

Degree Connectivity Primary FD

Binary or one-to-one 2 ways: key(one side) -> key(one side)

Binary one-to-many key(many side) -> key(one side)

Recursive many-to-many none (composite key from both sides)

Ternary one-to-one-to-one 3 ways: key(one), key(one) -> key(one)

one-to-one-to-many 2 ways: key(one), key(many) -> key(one)

one-to-many-to-many 1 way: key(many), key(many) -> key(one)

many-to-many-to-many none (composite key from all 3 sides)

Generalization none none (secondary FD only)



sumed table may result in the loss of retrieval efficiency, although stor-age and update costs will tend to be decreased. This trade-off must befurther analyzed during physical design with regard to processingrequirements to determine whether elimination of the subsumed table isreasonable.

To continue our example company personnel and project database,we want to obtain the primary FDs by applying the rules in Table 6.1 toeach relationship in the ER diagram in Figure 4.3. The results are shownin Table 6.2.

Next we want to determine the secondary FDs. Let us assume thatthe dependencies in Table 6.3 are derived from the requirements specifi-cation and intuition.

Normalization of the candidate tables is accomplished next. In Table6.4 we bring together the primary and secondary FDs that apply to eachcandidate table. We note that for each table except employee, allattributes are functionally dependent on the primary key (denoted bythe left side of the FDs) and are thus BCNF. In the case of tableemployee, we note that spouse_id determines emp_id and emp_id isthe primary key; thus spouse_id can be shown to be a superkey (seeSuperkey Rule 2 in Section 6.4). Therefore, employee is found to beBCNF.

Table 6.2 Primary FDs Derived from the ER Diagram in Figure 4.3

dept_no -> div_no in Department from relationship “contains”

emp_id -> dept_no in Employee from relationship “has”

div_no -> emp_id in Division from relationship “is-headed-by”

dept_no -> emp_id from binary relationship “is-managed-by”

emp_id -> desktop_no from binary relationship “has-allocated”

desktop_no -> emp_no from binary relationship “has-allocated”

emp_id -> spouse_id from binary recursive relationship “is-married-to”

spouse_id -> emp_id from binary recursive relationship “is-married-to”

emp_id, loc_name -> project_name from ternary relationship “assigned-to”


6.3 Normalization of Candidate Tables Derived from ER Diagrams 121

In general, we observe that candidate tables, like the ones shown inTable 6.4, are fairly good indicators of the final schema and normallyrequire very little refinement to get to 3NF or BCNF. This observation isimportant—good initial conceptual design usually results in tables thatare already normalized or are very close to being normalized, and thusthe normalization process is usually a simple task.

Table 6.3 Secondary FDs Derived from the Requirements Specification

div_no -> div_name, div_addr from entity Division

dept_no -> dept_name, dept_addr, mgr_id from entity Department

emp_id -> emp_name, emp_addr, office_no, phone_no from entity Employee

skill_type -> skill_descrip from entity Skill

project_name -> start_date, end_date, head_id from entity Project

loc_name -> loc_county, loc_state, zip from entity Location

mgr_id -> mgr_start_date, beeper_phone_no from entity Manager

assoc_name -> assoc_addr, phone_no, start_date from entity Prof-assoc

desktop_no -> computer_type, serial_no from entity Desktop

Table 6.4 Candidate Tables (and FDs) from ER Diagram Transformation

division div_no -> div_name, div_addr

div_no -> emp_id

department dept_no -> dept_name, dept_addr, mgr_iddept_no -> div_nodept_no -> emp_id

employee emp_id -> emp_name, emp_addr, office_no, phone_noemp_id -> dept_noemp_id -> spouse_idspouse_id -> emp_id

manager mgr_id -> mgr_start_date, beeper_phone_no

secretary none

engineer emp_id -> desktop_no



6.4 Determining the Minimum Set of 3NF Tables

A minimum set of 3NF tables can be obtained from a given set of FDs byusing the well-known synthesis algorithm developed by Bernstein[1976]. This process is particularly useful when you are confronted witha list of hundreds or thousands of FDs that describe the semantics of adatabase. In practice, the ER modeling process automatically decom-poses this problem into smaller subproblems: the attributes and FDs ofinterest are restricted to those attributes within an entity (and its equiva-lent table) and any foreign keys that might be imposed upon that table.Thus, the database designer will rarely have to deal with more than tenor twenty attributes at a time, and in fact most entities are initiallydefined in 3NF already. For those tables that are not yet in 3NF, onlyminor adjustments will be needed in most cases.

In the following text we briefly describe the synthesis algorithm forthose situations where the ER model is not useful for the decomposition.In order to apply the algorithm, we make use of the well-known Arm-strong axioms, which define the basic relationships among FDs.

Inference rules (Armstrong axioms)

technician none

skill skill_type -> skill_descrip

project project_name -> start_date, end_date, head_id

location loc_name -> loc_county, loc_state, zip

prof_assoc assoc_name -> assoc_addr, phone_no, start_date

desktop desktop_no -> computer_type, serial_nodesktop_no -> emp_no

assigned_to emp_id, loc_name -> project_name

skill_used none

Reflexivity If Y is a subset of the attributes of X, then X -> Y (i.e., if X is ABCD and Y is ABC, then X -> Y. Trivially, X -> X)

Augmentation If X -> Y and Z is a subset of table R (i.e., Z is any attribute in R), then XZ -> YZ.

Table 6.4 Candidate Tables (and FDs) from ER Diagram Transformation (continued)


6.4 Determining the Minimum Set of 3NF Tables 123

These axioms can be used to derive two practical rules of thumb forderiving superkeys of tables where at least one superkey is alreadyknown.

Superkey Rule 1. Any FD involving all attributes of a table defines asuperkey as the left side of the FD.

Given: Any FD containing all attributes in the table R(W,X,Y,Z), i.e., XY -> WZ.

Proof:

1. XY -> WZ as given.

2. XY -> XY by applying the reflexivity axiom.

3. XY -> XYWZ by applying the union axiom.

4. XY uniquely determines every attribute in table R, as shown in 3.

5. XY uniquely defines table R, by the definition of a table as havingno duplicate rows.

6. XY is therefore a superkey, by definition.

Superkey Rule 2. Any attribute that functionally determines a super-key of a table is also a superkey for that table.

Given: Attribute A is a superkey for table R(A,B,C,D,E), and E -> A.

Proof:

1. Attribute A uniquely defines each row in table R, by the defini-tion of a superkey.

2. A -> ABCDE by applying the definition of a superkey and a rela-tional table.

3. E -> A as given.

4. E -> ABCDE by applying the transitivity axiom.

5. E is a superkey for table R, by definition.

Transitivity If X->Y and Y->Z, then X->Z.

Pseudotransitivity If X->Y and YW->Z, then XW->Z.

(Transitivity is a special case of pseudotransitivity when W = null.)

Union If X->Y and X->Z, then X->YZ (or equivalently, X->Y,Z).

Decomposition If X->YZ, then X->Y and X->Z.



Before we can describe the synthesis algorithm, we must define someimportant concepts. Let H be a set of FDs that represents at least part ofthe known semantics of a database. The closure of H, specified by H+, isthe set of all FDs derivable from H using the Armstrong axioms or infer-ence rules. For example, we can apply the transitivity rule to the follow-ing FDs in set H:

A -> B, B -> C, A -> C, and C -> D

to derive the FDs A -> D and B -> D. All six FDs constitute the closure H+.A cover of H, called H’, is any set of FDs from which H+ can be derived.Possible covers for this example are:

1. A->B, B->C, C->D, A->C, A->D, B->D (trivial case where H’ and H+are equal)

2. A->B, B->C, C->D, A->C, A->D

3. A->B, B->C, C->D, A->C (this is the original set H)

4. A->B, B->C, C->D

A nonredundant cover of H is a cover of H that contains no propersubset of FDs, which is also a cover. The synthesis algorithm requiresnonredundant covers.

3NF Synthesis Algorithm

Given a set of FDs, H, we determine a minimum set of tables in 3NF.

From this point the process of arriving at the minimum set of 3NFtables consists of five steps:

1. Eliminate extraneous attributes in the left sides of the FDs

2. Search for a nonredundant cover, G of H

H: AB->C DM->NP

A->DEFG D->M

E->G L->D

F->DJ PQR->ST

G->DI PR->S

D->KL


6.4 Determining the Minimum Set of 3NF Tables 125

3. Partition G into groups so that all FDs with the same left side arein one group

4. Merge equivalent keys

5. Define the minimum set of normalized tables

Now we discuss each step in turn, in terms of the preceding set ofFDs, H.

Step 1. Elimination of Extraneous Attributes

The first task is to get rid of extraneous attributes in the left sides of theFDs.

The following two relationships (rules) among attributes on the leftside of an FD provide the means to reduce the left side to fewerattributes.

Reduction Rule 1. XY->Z and X->Z => Y is extraneous on the left side(applying the reflexivity and transitivity axioms).

Reduction Rule 2. XY->Z and X->Y => Y is extraneous; therefore X->Z(applying the pseudotransitivity axiom).

Applying these reduction rules to the set of FDs in H, we get:

DM->NP and D->M => D->NP

PQR->ST and PR->S => PQR->T

Step 2. Search for a Nonredundant Cover

We must eliminate any FD derivable from others in H using the infer-ence rules.

Transitive FDs to be eliminated:

A->E and E->G => eliminate A->G

A->F and F->D => eliminate A->D

Step 3. Partitioning of the Nonredundant Cover

To partition the nonredundant cover into groups so that all FDs with thesame left side are in one group, we must separate the nonfully func-tional dependencies and transitive dependencies into separate tables. At



this point we have a feasible solution for 3NF tables, but it is not neces-sarily the minimum set.

These nonfully functional dependencies must be put into separatetables:

AB->C

A->EF

Groups with the same left side:

Step 4. Merge of Equivalent Keys (Merge of Tables)

In this step we merge groups with left sides that are equivalent (forexample, X->Y and Y->X imply that X and Y are equivalent). This stepproduces a minimum set.

1. Write out the closure of all left side attributes resulting from Step3, based on transitivities.

2. Using the closures, find tables that are subsets of other groups andtry to merge them. Use Superkey Rule 1 and Superkey Rule 2 toestablish whether the merge will result in FDs with superkeys onthe left side. If not, try using the axioms to modify the FDs to fitthe definition of superkeys.

3. After the subsets are exhausted, look for any overlaps amongtables and apply Superkey Rules 1 and 2 (and the axioms) again.

In this example, note that G7 (L->D) has a subset of the attributes ofG6 (D->KLMNP). Therefore, we merge to a single table, R6, with FDsD->KLMNP and L->D, because it satisfies 3NF: D is a superkey by Super-key Rule 1, and L is a superkey by Superkey Rule 2.

G1: AB->C G6: D->KLMNP

G2: A->EF G7: L->D

G3: E->G G8: PQR->T

G4: G->DI G9: PR->S

G5: F->DJ


6.5 Fourth and Fifth Normal Forms 127

Step 5. Definition of the Minimum Set of Normalized Tables

The minimum set of normalized tables has now been computed. Wedefine them below in terms of the table name, the attributes in the table,the FDs in the table, and the candidate keys for that table:

Note that this result is not only 3NF, but also BCNF, which is veryfrequently the case. This fact suggests a practical algorithm for a (near)minimum set of BCNF tables: Use Bernstein’s algorithm to attain a mini-mum set of 3NF tables, then inspect each table for further decomposi-tion (or partial replication, as shown in Section 6.1.5) to BCNF.

6.5 Fourth and Fifth Normal Forms

Normal forms up to BCNF were defined solely on FDs, and, for mostdatabase practitioners, either 3NF or BCNF is a sufficient level of normal-ization. However, there are in fact two more normal forms that areneeded to eliminate the rest of the currently known anomalies. In thissection, we will look at different types of constraints on tables: multival-ued dependencies and join dependencies. If these constraints do notexist in a table, which is the most common situation, then any table inBCNF is automatically in fourth normal form (4NF), and fifth normalform (5NF) as well. However, when these constraints do exist, there maybe further update (especially delete) anomalies that need to be corrected.First, we must define the concept of multivalued dependency.

6.5.1 Multivalued Dependencies

Definition. In a multivalued dependency (MVD), X->>Y holds on tableR with table scheme RS if, whenever a valid instance of tableR(X,Y,Z) contains a pair of rows that contain duplicate values of X,

R1: ABC (AB->C with key AB) R5: DFJ (F->DJ with key F)

R2: AEF (A->EF with key A) R6: DKLMNP (D->KLMNP, L->D, with keys D, L)

R3: EG (E->G with key E) R7: PQRT (PQR->T with key PQR)

R4: DGI (G->DI with key G) R8: PRS (PR->S with key PR)



then the instance also contains the pair of rows obtained by inter-changing the Y values in the original pair. This includes situationswhere only pairs of rows exist. Note that X and Y may contain eithersingle or composite attributes.

An MVD X ->> Y is trivial if Y is a subset of X, or if X union Y = RS.Finally, an FD implies an MVD, which implies that a single row with agiven value of X is also an MVD, albeit a trivial form.

The following examples show where an MVD does and does notexist in a table. In R1, the first four rows satisfy all conditions for theMVDs X->>Y and X->>Z. Note that MVDs appear in pairs because of thecross-product type of relationship between Y and Z=RS-Y as the tworight sides of the two MVDs. The fifth and sixth rows of R1 (when the Xvalue is 2) satisfy the row interchange conditions in the above defini-tion. In both rows, the Y value is 2, so the interchanging of Y values istrivial. The seventh row (3,3,3) satisfies the definition trivially.

In table R2, however, the Y values in the fifth and sixth rows are dif-ferent (1 and 2), and interchanging the 1 and 2 values for Y results in arow (2,2,2) that does not appear in the table. Thus, in R2 there is noMVD between X and Y or between X and Z, even though the first fourrows satisfy the MVD definition. Note that for the MVD to exist, all rowsmust satisfy the criterion for an MVD.

Table R3 contains the first three rows that do not satisfy the crite-rion for an MVD, since changing Y from 1 to 2 in the second row resultsin a row that does not appear in the table. Similarly, changing Z from 1to 2 in the third row results in a nonappearing row. Thus, R3 does nothave any MVDs between X and Y or between X and Z.

R1: X Y Z R2: X Y Z R3: X Y Z

1 1 1 1 1 1 1 1 1

1 1 2 1 1 2 1 1 2

1 2 1 1 2 1 1 2 1

1 2 2 1 2 2 2 2 1

2 2 1 2 2 1 2 2 2

2 2 2 2 1 2

3 3 3



By the same argument, in table R1 we have the MVDs Y->> X andY->>Z, but none with Z on the left side. Tables R2 and R3 have noMVDs at all.

The following inference rules for MVDs are somewhat analogous tothe inference rules for functional dependencies given in Section 6.4[Beeri, Fagin, and Howard, 1977]. They are quite useful in the analysisand decomposition of tables into 4NF.

Multivalued Dependency Inference Rules

6.5.2 Fourth Normal Form

The goal of 4NF is to eliminate nontrivial MVDs from a table by project-ing them onto separate smaller tables, and thus to eliminate the updateanomalies associated with the MVDs. This type of normal form is rea-sonably easy to attain if you know where the MVDs are. In general,MVDs must be defined from the semantics of the database; they cannotbe determined from just looking at the data. The current set of data canonly verify whether your assumption about an MVD is currently true ornot, but this may change each time the data is updated.

Reflexivity X -->> X

Augmentation If X -->> Y, then XZ -->> Y.

Transitivity If X -->>Y and Y -->> Z, then X -->> (Z-Y).

Pseudotransitivity If X -->> Y and YW -->> Z, then XW -->> (Z-YW).

(Transitivity is a special case of pseudotransitivity when W is null.)

Union If X -->> Y and X -->> Z, then X -->> YZ.

Decomposition If X -->> Y and X -->> Z, then X -->> Y intersect Z and X -->> (Z-Y).

Complement If X -->> Y and Z=R-X-Y, then X -->> Z.

FD Implies MVD If X -> Y, then X -->> Y.

FD, MVD Mix If X -->> Z and Y -->> Z’ (where Z’ is contained in Z, and Y and Z are disjoint), then X->Z’.



Definition. A table R is in fourth normal form (4NF) if and only if it isin BCNF and, whenever there exists an MVD in R (say X ->> Y), atleast one of the following holds: the MVD is trivial, or X is a super-key for R.

Applying this definition to the three tables in the example in theprevious section, we see that R1 is not in 4NF because at least one non-trivial MVD exists and no single column is a superkey. In tables R2 andR3, however, there are no MVDs. Thu,s these two tables are at least 4NF.

As an example of the transformation of a table that is not in 4NF totwo tables that are in 4NF, we observe the ternary relationship skill-required, shown in Figure 6.6. The relationship skill-required is definedas follows: “An employee must have all the required skills needed for aproject to work on that project.” For example, in Table 6.5 the projectwith proj_no = 3 requires skill types A and B by all employees (seeemployees 101 and 102). The table skill_required has no FDs, but itdoes have several nontrivial MVDs, and is therefore only in BCNF. Insuch a case it can have a lossless decomposition into two many-to-manybinary relationships between the entities Employee and Project, andProject and Skill. Each of these two new relationships represents a tablein 4NF. It can also have a lossless decomposition resulting in a binarymany-to-many relationship between the entities Employee and Skill,and Project and Skill.

A two-way lossless decomposition occurs when skill_required isprojected over (emp_id, proj_no) to form skill_req1 and projected over(proj_no, skill_type) to form skill_req3. Projection over (emp_id,

Figure 6.6 Ternary relationship with multiple interpretations

Employee

Skill ProjectN N

N

** (1) skill-required(2) skill-in-common(3) skill-used

**



proj_no) to form skill_req1 and over (emp_id, skill_type) to formskill_req2, however, is not lossless. A three-way lossless decompositionoccurs when skill_required is projected over (emp_id, proj_no),(emp_id, skill_type), and (proj_no, skill_type).

Tables in 4NF avoid certain update anomalies (or inefficiences). Forinstance, a delete anomaly exists when two independent facts get tiedtogether unnaturally so that there may be bad side effects of certaindeletes. For example, in skill_required, the last row of a skill_type maybe lost if an employee is temporarily not working on any projects. Anupdate inefficiency may occur when adding a new project inskill_required, which requires insertions for many rows to include allthe required skills for that new project. Likewise, loss of a projectrequires many deletions. These inefficiencies are avoided when

Table 6.5 The Table skill_required and Its Three Projections

skill_required emp_id proj_no skill_type MVDs(nontrivial)

101 3 A proj_no ->> skill_type

101 3 B proj_no ->> emp_id

101 4 A

101 4 C

102 3 A

102 3 B

103 5 D

skill_req1 skill_req2 skill_req3

emp_id proj_no emp_id skill_type proj_no skill_type

101 3 101 A 3 A

101 4 101 B 3 B

102 3 101 C 4 A

103 5 102 A 4 C

102 B 5 D

103 D



skill_required is decomposed into skill_req1 and skill_req3. Ingeneral (but not always), decomposition of a table into 4NF tables resultsin less data redundancy.

6.5.3 Decomposing Tables to 4NF

Algorithms to decompose tables into 4NF are difficult to develop. Let’slook at some straightforward approaches to 4NF from BCNF and lowernormal forms. First, if a table is BCNF, it either has no FDs, or each FD ischaracterized by its left side being a superkey. Thus, if the only MVDs inthis table are derived from its FDs, they have only superkeys as their leftsides, and the table is 4NF by definition. If, however, there are othernontrivial MVDs whose left sides are not superkeys, the table is only inBCNF and must be decomposed to achieve higher normalization.

The basic decomposition process from a BCNF table is defined byselecting the most important MVD (or if that is not possible, then byselecting one arbitrarily), defining its complement MVD, and decom-pose the table into two tables containing the attributes on the left andright sides of that MVD and its complement. This type of decompositionis lossless because each new table is based on the same attribute, whichis the left side of both MVDs. The same MVDs in these new tables arenow trivial because they contain every attribute in the table. However,other MVDs may be still present, and more decompositions by MVDsand their complements may be necessary. This process of arbitrary selec-tion of MVDs for decomposition is continued until only trivial MVDsexist, leaving the final tables in 4NF.

As an example, let R(A,B,C,D,E,F) with no FDs, and with MVDs A ->>B and CD ->> EF. The first decomposition of R is into two tables R1(A,B)and R2(A,C,D,E,F) by applying the MVD A ->> B and its complementA ->> CDEF. Table R1 is now 4NF, because A ->> B is trivial and is theonly MVD in the table. Table R2, however, is still only BCNF, because ofthe nontrivial MVD CD ->> EF. We then decompose R2 intoR21(C,D,E,F) and R22(C,D,A) by applying the MVD CD ->> EF and itscomplement CD ->> A. Both R21 and R22 are now 4NF. If we hadapplied the MVD complement rule in the opposite order, using CD ->>EF and its complement CD ->> AB first, the same three 4NF tables wouldresult from this method. However, this does not occur in all cases; itonly occurs in those tables where the MVDs have no intersectingattributes.

This method, in general, has the unfortunate side effect of poten-tially losing some or all of the FDs and MVDs. Therefore, any decision to



transform tables from BCNF to 4NF must take into account the trade-offbetween normalization and the elimination of delete anomalies, and thepreservation of FDs and possibly MVDs. It should also be noted that thisapproach derives a feasible, but not necessarily a minimum, set of 4NFtables.

A second approach to decomposing BCNF tables is to ignore theMVDs completely and split each BCNF table into a set of smaller tables,with the candidate key of each BCNF table being the candidate key of anew table and the nonkey attributes distributed among the new tables insome semantically meaningful way. This form of decomposing by candi-date key (that is, superkey) is lossless because the candidate keysuniquely join; it usually results in the simplest form of 5NF tables, thosewith a candidate key and one nonkey attribute, and no MVDs. However,if one or more MVDs still exist, further decomposition must be donewith the MVD/MVD-complement approach given above. The decompo-sition by candidate keys preserves FDs, but the MVD/MVD-complementapproach does not preserve either FDs or MVDs.

Tables that are not yet in BCNF can also be directly decomposed into4NF using the MVD/MVD-complement approach. Such tables can oftenbe decomposed into smaller minimum sets than those derived fromtransforming into BCNF first and then 4NF, but with a greater cost oflost FDs. In most database design situations, it is preferable to developBCNF tables first, then evaluate the need to normalize further while pre-serving the FDs.

6.5.4 Fifth Normal Form

Definition. A table R is in fifth normal form (5NF) or project-join nor-mal form (PJ/NF) if and only if every join dependency in R isimplied by the keys of R.

As we recall, a lossless decomposition of a table implies that it can bedecomposed by two or more projections, followed by a natural join ofthose projections (in any order) that results in the original table, withoutany spurious or missing rows. The general lossless decomposition con-straint, involving any number of projections, is also known as a joindependency (JD). A join dependency is illustrated by the following exam-ple: in a table R with n arbitrary subsets of the set of attributes of R, Rsatisfies a join dependency over these n subsets if and only if R is equalto the natural join of its projections on them. A JD is trivial if one of thesubsets is R itself.



5NF or PJ/NF requires satisfaction of the membership algorithm[Fagin, 1979], which determines whether a JD is a member of the set oflogical consequences of (can be derived from) the set of key dependen-cies known for this table. In effect, for any 5NF table, every dependency(FD, MVD, JD) is determined by the keys. As a practical matter we notethat because JDs are very difficult to determine in large databases withmany attributes, 5NF tables are not easily derivable, and logical databasedesign typically produces BCNF tables.

We should also note that by the preceding definitions, just because atable is decomposable does not necessarily mean it is not 5NF. For exam-ple, consider a simple table with four attributes (A,B,C,D), one FD (A->BCD), and no MVDs or JDs not implied by this FD. It could be decom-

Table 6.6 The Table skill_in_common and Its Three Projections

skill_in_common emp_id proj_no skill_type

101 3 A

101 3 B

101 4 A

101 4 B

102 3 A

102 3 B

103 3 A

103 4 A

103 5 A

103 5 C

skill_in_com1 skill_in_com2 skill_in_com3

emp_id proj_no emp_id skill_type proj_no skill_type

101 3 101 A 3 A

101 4 101 B 3 B

102 3 102 A 4 A

103 3 102 B 4 B

103 4 103 A 5 A

103 5 103 C 5 C



posed into three tables, A->B, A->C, and A->D, all based on the samesuperkey A; however, it is already in 5NF without the decomposition.Thus, the decomposition is not required for normalization. On the otherhand, decomposition can be a useful tool in some instances for perfor-mance improvement.

The following example demonstrates that a table representing a ter-nary relationship may not have any two-way lossless decompositions;however, it may have a three-way lossless decomposition, which isequivalent to three binary relationships, based on the three possible pro-jections of this table. This situation occurs in the relationship skill-in-common (Figure 6.6), which is defined as “The employee must apply theintersection of his or her available skills with the skills needed to workon certain projects.” In this example, skill-in-common is less restrictivethan skill-required because it allows an employee to work on a projecteven if he or she does not have all the skills required for that project.

As Table 6.6 shows, the three projections of skill_in_commonresult in a three-way lossless decomposition. There are no two-way loss-less decompositions and no MVDs; thus, the table skill_in_common isin 4NF.

The ternary relationship in Figure 6.6 can be interpreted yet anotherway. The meaning of the relationship skill-used is “We can selectivelyrecord different skills that each employee applies to working on individ-ual projects.” It is equivalent to a table in 5NF that cannot be decom-posed into either two or three binary tables. Note by studying Table 6.7that the associated table, skill_used, has no MVDs or JDs.

Table 6.7 The Table skill_used, Its Three Projections, and Natural Joins of Its Projections

skill_used emp_id proj_no skill_type

101 3 A

101 3 B

101 4 A

101 4 C

102 3 A

102 3 B

102 4 A

102 4 B



A table may have constraints that are FDs, MVDs, and JDs. An MVDis a special case of a JD. To determine the level of normalization of thetable, analyze the FDs first to determine normalization through BCNF;then analyze the MVDs to determine which BCNF tables are also 4NF;then, finally, analyze the JDs to determine which 4NF tables are also5NF.

Three projections on skill_used result in:

skill_used1 skill_used2 skill_used3

emp_id proj_no proj_no skill_type emp_id skill_type

101 3 3 A 101 A

101 4 3 B 101 B

102 3 4 A 101 C

102 4 4 B 102 A

4 C 102 B

join skill_used1 with skill_used2 to form:

join skill_used12 with skill_used3 to form:

skill_used_12 skill_used_123

emp_id proj_no skill_type emp_id proj_no skill_type

101 3 A 101 3 A

101 3 B 101 3 B

101 4 A 101 4 A

101 4 B 101 4 B (spurious)

101 4 C 101 4 C

102 3 A 102 3 A

102 3 B 102 3 B

102 4 A 102 4 A

102 4 B 102 4 B

102 4 C

Table 6.7 The Table skill_used, Its Three Projections, and Natural Joins of Its Projections (continued)


6.6 Summary 137

A many-to-many-to-many ternary relationship is:

• BCNF if it can be replaced by two binary relationships

• 4NF if it can only be replaced by three binary relationships

• 5NF if it cannot be replaced in any way (and thus is a true ternaryrelationship)

We observe the equivalence between certain ternary relationshipsand the higher normal form tables transformed from those relation-ships. Ternary relationships that have at least one “one” entity cannotbe decomposed (or broken down) into binary relationships, because thatwould destroy the one or more FDs required in the definition, as shownpreviously. A ternary relationship with all “many” entities, however, hasno FDs, but in some cases may have MVDs, and thus have a losslessdecomposition into equivalent binary relationships.

In summary, the three common cases that illustrate the correspon-dence between a lossless decomposition in a many-to-many-to-manyternary relationship table and higher normal forms in the relationalmodel are shown in Table 6.8.

6.6 Summary

In this chapter, we defined the constraints imposed on tables—mostcommonly the functional dependencies or FDs. Based on these con-straints, practical normal forms for database tables are defined: 1NF, 2NF,3NF, and BCNF. All are based on the types of FDs present. In this chapter,a practical algorithm for finding the minimum set of 3NF tables is given.

Table 6.8 Summary of Higher Normal Forms

Table Name Normal FormTwo-way Losslessdecomp/join?

Three-way Losslessdecomp/join?

NontrivialMVDs

skill_required BCNF yes yes 2

skill_in_common

4NF no yes 0

skill_used 5NF no no 0



The following statements summarize the functional equivalencebetween the ER model and normalized tables:

1. Within an entity. The level of normalization is totally dependentupon the interrelationships among the key and nonkeyattributes. It could be any form from unnormalized to BCNF orhigher.

2. Binary (or binary recursive) one-to-one or one-to-many relationship.Within the “child” entity, the foreign key (a replication of the pri-mary key of the “parent”) is functionally dependent upon thechild’s primary key. This is at least BCNF, assuming that the entityby itself, without the foreign key, is already BCNF.

3. Binary (or binary recursive) many-to-many relationship. The intersec-tion table has a composite key and possibly some nonkeyattributes functionally dependent upon it. This is at least BCNF.

4. Ternary relationship.

a. one-to-one-to-one => three overlapping composite keys, atleast BCNF

b. one-to-one-to-many => two overlapping composite keys, atleast BCNF

c. one-to-many-to-many => one composite key, at least BCNFd. many-to-many-to-many => one composite key with three

attributes, at least BCNF; in some cases it can be also 4NF, oreven 5NF

In summary, we observed that a good, methodical conceptual designprocedure often results in database tables that are either normalized(BCNF) already, or can be normalized with very minor changes.


Good summaries of normal forms can be found in Date [1999], Kent[1983], Dutka and Hanson [1989], and Smith [1985]. Algorithms for nor-mal form decomposition and synthesis techniques are given in Bern-stein [1976], Fagin [1977], and Maier [1983]. The earliest work in normalforms was done by Codd [1970, 1974].


139

7An Example of Logical Database Design

he following example illustrates how to proceed through the require-ments analysis and logical design steps of the database life cycle, in a

practical way, for a relational database.

7.1 Requirements Specification

The management of a large retail store would like a database to keeptrack of sales activities. The requirements analysis for this database led tothe six entities and their unique identifiers shown in Table 7.1.

The following assertions describe the data relationships:

• Each customer has one job title, but different customers may havethe same job title.

• Each customer may place many orders, but only one customermay place a particular order.

• Each department has many salespeople, but each salespersonmust work in only one department.

• Each department has many items for sale, but each item is sold inonly one department. (“Item” means item type, like IBM PC).

T


140 CHAPTER 7 An Example of Logical Database Design

• For each order, items ordered in different departments mustinvolve different salespeople, but all items ordered within onedepartment must be handled by exactly one salesperson. In otherwords, for each order, each item has exactly one salesperson; andfor each order, each department has exactly one salesperson.

For physical design (e.g., access methods, etc.) it is necessary todetermine what kind of processing needs to be done on the data; that is,what are the queries and updates needed to satisfy the user require-ments, and what are their frequencies? In addition, the requirementsanalysis should determine whether there will be substantial databasegrowth (i.e., volumetrics); in what time frame that growth will takeplace; and whether the frequency and type of queries and updates willchange, as well. Decay as well as growth should be estimated, as eachwill have significant effect on the later stages of database design.

7.1.1 Design Problems

1. Using the information given and, in particular, the five assertions,derive a conceptual data model and a set of functional dependen-cies (FDs) that represent all the known data relationships.

2. Transform the conceptual data model into a set of candidate SQLtables. List the tables, their primary keys, and other attributes.

3. Find the minimum set of normalized (BCNF) tables that are func-tionally equivalent to the candidate tables.

Table 7.1 Requirements Analysis Results

Entity Entity keyin characters

Keylength(max)

Number ofoccurrences

Customer cust-no 6 80,000

Job job-no 24 80

Order order-no 9 200,000

Salesperson sales-id 20 150

Department dept-no 2 10

Item item-no 6 5,000


7.2 Logical Design 141

7.2 Logical Design

Our first step is to develop a conceptual data model diagram and a set ofFDs to correspond to each of the assertions given. Figure 7.1 presents thediagram for the ER model, and Figure 7.2 shows the equivalent diagramfor UML. Normally, the conceptual data model is developed withoutknowing all the FDs, but in this example the nonkey attributes are omit-ted so that the entire database can be represented with only a few state-ments and FDs. The result of this analysis, relative to each of the asser-tions given, follows in Table 7.2.

The candidate tables required to represent the semantics of thisproblem can be derived easily from the constructs for entities and rela-tionships. Primary keys and foreign keys are explicitly defined.

Figure 7.1 Conceptual data model diagram for the ER model

N

N 1

1

1

NN

N

N 1 1

1

N

1

Customer

SalespersonOrder

Job

Department

Itemorder-

item-sales

contains

has

order-dept-sales

places

hires



Figure 7.2 Conceptual data model diagram for UML

Table 7.2 Results of the Analysis of the Conceptual Data Model

ER Construct FDs

Customer(many): Job(one) cust-no -> job-title

Order(many): Customer(one) order-no -> cust-no

Salesperson(many): Department(one) sales-id -> dept-no

Item(many): Department(one) item-no -> dept-no

Order(many): Item(many):Salesperson(one)

order-no,item-no->sales-id

Order(many): Department(many):Salesperson(one)

order-no,dept-no-> sales-id

Customer

SalespersonOrder

Job

Department

order-dept-sales

Item

*

* 1has

1

1

**

*

* 1 1

1

*

places

hires

1

contains

order-item-sales


7.2 Logical Design 143

create table customer (cust_no char(6),job_title varchar(256),primary key (cust_no),foreign key (job_title) references job


create table job (job_no char(6),job_title varchar(256),primary key (job_no));

create table order (order_no char(9),cust_no char(6) not null,primary key (order_no),foreign key (cust_no) references customer


create table salesperson(sales_id char(10) sales_name varchar(256),dept_no char(2),primary key (sales_id),foreign key (dept_no) references department


create table department (dept_no char(2),dept_name varchar(256),manager_name varchar(256),primary key (dept_no));

create table item (item_no char(6),dept_no char(2),primary key (item_no),foreign key (dept_no) references department




create table order_item_sales (order_no char(9),item_no char(6),sales_id varchar(256) not null,primary key (order_no, item_no),foreign key (order_no) references order

on delete cascade on update cascade,foreign key (item_no) references item

on delete cascade on update cascade,foreign key (sales_id) references salesperson


create table order_dept_sales (order_no char(9),dept_no char(2),sales_id varchar(256) not null,primary key (order_no, dept_no),foreign key (order_no) references order

on delete cascade on update cascade,foreign key (dept_no) references department

on delete cascade on update cascade,foreign key (sales_id) references salesperson


Note that it is often better to put foreign key definitions in separate(alter) statements. This prevents the possibility of getting circular defini-tions with very large schemas.

This process of decomposition and reduction of tables moves uscloser to a minimum set of normalized (BCNF) tables, as shown in Table7.3.

The reductions shown in this section have decreased storage spaceand update cost and have maintained the normalization of BCNF (andthus 3NF). On the other hand, we have potentially higher retrievalcost—given the transaction “list all job_titles,” for example—and haveincreased the potential for loss of integrity because we have eliminatedsimple tables with only key attributes. Resolution of these trade-offsdepends on your priorities for your database.

The details of indexing will be covered in the companion book, Phys-ical Database Design. However, during the logical design phase of defin-ing SQL tables, it makes sense to start considering where to createindexes. At a minimum, all primary keys and all foreign keys should be


7.3 Summary 145

indexed. Indexes are relatively easy to implement and store, and make asignificant difference in reducing the access time to stored data.

7.3 Summary

In this chapter, we developed a global conceptual schema and a set ofSQL tables for a relational database, given the requirements specificationfor a retail store database. The example illustrates the database life cyclesteps of conceptual data modeling, global schema design, transforma-tion to SQL tables, and normalization of those tables. It summarizes thetechniques presented in Chapters 1 through 6.

Table 7.3 Decomposition and Reduction of Tables

Table Primary key Likely non-keys

customer cust_no job_title, cust_name, cust_address

order order_no cust_no, item_no, date_of_purchase, price

salesperson sales_id dept_no, sales_name, phone_no

item item_no dept_no, color, model_no

order_item_sales order_no,item_no sales_id

order_dept_sales order_no,dept_no sales_id


147

8Business Intelligence

Business intelligence has become a buzzword in recent years. The data-base tools found under the heading of business intelligence include datawarehousing, online analytical processing (OLAP), and data mining. Thefunctionalities of these tools are complementary and interrelated. Datawarehousing provides for the efficient storage, maintenance, and retrievalof historical data. OLAP is a service that provides quick answers to ad hocqueries against the data warehouse. Data mining algorithms find patternsin the data and report models back to the user. All three tools are relatedto the way data in a data warehouse are logically organized, and perfor-mance is highly sensitive to the database design techniques used [Bar-quin and Edelstein, 1997]. The encompassing goal for business intelli-gence technologies is to provide useful information for decision support.

Each of the major DBMS vendors is marketing the tools for datawarehousing, OLAP, and data mining as business intelligence. This chap-ter covers each of these technologies in turn. We take a close look at therequirements for a data warehouse; its basic components and principlesof operation; the critical issues in its design; and the important logicaldatabase design elements in its environment. We then investigate thebasic elements of OLAP and data mining as special query techniquesapplied to data warehousing. We cover data warehousing in Section 8.1,OLAP in Section 8.2, and data mining in Section 8.3.


148 CHAPTER 8 Business Intelligence

8.1 Data Warehousing

A data warehouse is a large repository of historical data that can be inte-grated for decision support. The use of a data warehouse is markedly dif-ferent from the use of operational systems. Operational systems containthe data required for the day-to-day operations of an organization. Thisoperational data tends to change quickly and constantly. The table sizesin operational systems are kept manageably small by periodically purg-ing old data. The data warehouse, by contrast, periodically receives his-torical data in batches, and grows over time. The vast size of data ware-houses can run to hundreds of gigabytes, or even terabytes. The problemthat drives data warehouse design is the need for quick results to queriesposed against huge amounts of data. The contrasting aspects of datawarehouses and operational systems result in a distinctive designapproach for data warehousing.

8.1.1 Overview of Data Warehousing

A data warehouse contains a collection of tools for decision supportassociated with very large historical databases, which enables the enduser to make quick and sound decisions. Data warehousing grew out ofthe technology for decision support systems (DSS) and executive infor-mation systems (EIS). DSSs are used to analyze data from commonlyavailable databases with multiple sources, and to create reports. Thereport data is not time critical in the sense that a real-time system is, butit must be timely for decision making. EISs are like DSSs, but more pow-erful, easier to use, and more business specific. EISs were designed to pro-vide an alternative to the classical online transaction processing (OLTP)systems common to most commercially available database systems.OLTP systems are often used to create common applications, includingthose with mission-critical deadlines or response times. Table 8.1 sum-marizes the basic differences between OLTP and data warehouse systems.

The basic architecture for a data warehouse environment is shown inFigure 8.1. The diagram shows that the data warehouse is stocked by avariety of source databases from possibly different geographical loca-tions. Each source database serves its own applications, and the datawarehouse serves a DSS/EIS with its informational requests. Each feedersystem database must be reconciled with the data warehouse datamodel; this is accomplished during the process of extracting therequired data from the feeder database system, transforming the data


8.1 Data Warehousing 149

from the feeder system to the data warehouse, and loading the data intothe data warehouse [Cataldo, 1997].

Core Requirements for Data Warehousing

Let us now take a look at the core requirements and principles that guidethe design of data warehouses (DWs) [Simon, 1995; Barquin and Edel-stein, 1997; Chaudhuri and Dayal, 1997; Gray and Watson, 1998]:

1. DWs are organized around subject areas. Subject areas are analo-gous to the concept of functional areas, such as sales, projectmanagement, or employees, as discussed in the context of ER dia-gram clustering in Section 4.5. Each subject area has its own con-ceptual schema and can be represented using one or more entitiesin the ER data model or by one or more object classes in theobject-oriented data model. Subject areas are typically indepen-dent of individual transactions involving data creation or manip-ulation. Metadata repositories are needed to describe sourcedatabases, DW objects, and ways of transforming data from thesources to the DW.

2. DWs should have some integration capability. A common datarepresentation should be designed so that all the different indi-vidual representations can be mapped to it. This is particularly

Table 8.1 Comparison between OLTP and Data Warehouse Databases

OLTP Data Warehouse

Transaction oriented Business process oriented

Thousands of users Few users (typically under 100)

Generally small (MB up to several GB) Large (from hundreds of GB to several TB)

Current data Historical data

Normalized data(many tables, few columns per table)

Denormalized data(few tables, many columns per table)

Continuous updates Batch updates*

Simple to complex queries Usually very complex queries

* There is currently a push in the industry towards “active warehousing,” in which thewarehouse receives data in continuous updates. See Section 8.2.5 for further discussion.



useful if the warehouse is implemented as a multidatabase or fed-erated database.

3. The data is considered to be nonvolatile and should be massloaded. Data extraction from current databases to the DWrequires that a decision should be made whether to extract thedata using standard relational database (RDB) techniques at therow or column level or specialized techniques for mass extraction.Data cleaning tools are required to maintain data quality—forexample, to detect missing data, inconsistent data, homonyms,synonyms, and data with different units. Data migration, datascrubbing, and data auditing tools handle specialized problems indata cleaning and transformation. Such tools are similar to thoseused for conventional relational database schema (view) integra-tion. Load utilities take cleaned data and load it into the DW,using batch processing techniques. Refresh techniques propagateupdates on the source data to base data and derived data in theDW. The decision of when and how to refresh is made by the DW

Figure 8.1 Basic data warehouse architecture

feederDB1

OperationalApplications



feederDB2

feederDB3

DataWarehouse

Extract

Report Generators

Ad Hoc Query Tools

OLAP

Datamining

Extract Extract

Staging AreaTransform

Load



administrator and depends on user needs (e.g., OLAP needs) andexisting traffic to the DW.

4. Data tends to exist at multiple levels of granularity. Most impor-tant, the data tends to be of a historical nature, with potentiallyhigh time variance. In general, however, granularity can varyaccording to many different dimensions, not only by time framebut also by geographic region, type of product manufactured orsold, type of store, and so on. The sheer size of the databases is amajor problem in the design and implementation of DWs, espe-cially for certain queries, updates, and sequential backups. Thisnecessitates a critical decision between using a relational database(RDB) or a multidimensional database (MDD) for the implemen-tation of a DW.

5. The DW should be flexible enough to meet changing require-ments rapidly. Data definitions (schemas) must be broad enoughto anticipate the addition of new types of data. For rapidly chang-ing data retrieval requirements, the types of data and levels ofgranularity actually implemented must be chosen carefully.

6. The DW should have a capability for rewriting history, that is,allowing for “what-if” analysis. The DW should allow the admin-istrator to update historical data temporarily for the purpose of“what-if” analysis. Once the analysis is completed, the data mustbe correctly rolled back. This condition assumes that the data areat the proper level of granularity in the first place.

7. A usable DW user interface should be selected. The leadingchoices today are SQL, multidimensional views of relational data,or a special-purpose user interface. The user interface languagemust have tools for retrieving, formatting, and analyzing data.

8. Data should be either centralized or distributed physically. TheDW should have the capability to handle distributed data over anetwork. This requirement will become more critical as the use ofDWs grows and the sources of data expand.

The Life Cycle of Data Warehouses

Entire books have been written about select portions of the data ware-house life cycle. Our purpose in this section is to present some of thebasics and give the flavor of data warehousing. We strongly encouragethose who wish to pursue data warehousing to continue learningthrough other books dedicated to data warehousing. Kimball and Ross



[1998, 2002] have a series of excellent books covering the details of datawarehousing activities.

Figure 8.2 outlines the activities of the data warehouse life cycle,based heavily on Kimball and Ross’s Figure 16.1 [2002]. The life cyclebegins with a dialog to determine the project plan and the businessrequirements. When the plan and the requirements are aligned, designand implementation can proceed. The process forks into three threadsthat follow independent timelines, meeting up before deployment (seeFigure 8.2). Platform issues are covered in one thread, including techni-cal architectural design, followed by product selection and installation.Data issues are covered in a second thread, including dimensional mod-eling and then physical design, followed by data staging design anddevelopment. The special analytical needs of the users are pursued in thethird thread, including analytic application specification followed byanalytic application development. These three threads join beforedeployment. Deployment is followed by maintenance and growth, andchanges in the requirements must be detected. If adjustments areneeded, the cycle repeats. If the system becomes defunct, then the lifecycle terminates.

The remainder of our data warehouse section focuses on the dimen-sional modeling activity. More comprehensive material can be found inKimball and Ross [1998, 2002] and Kimball and Caserta [2004].

8.1.2 Logical Design

We discuss the logical design of data warehouses in this section; thephysical design issues are covered in volume two. The logical design ofdata warehouses is defined by the dimensional data modeling approach.We cover the schema types typically encountered in dimensional model-ing, including the star schema and the snowflake schema. We outlinethe dimensional design process, adhering to the methodology describedby Kimball and Ross [2002]. Then we walk through an example, cover-ing some of the crucial concepts of dimensional data modeling.

Dimensional Data Modeling

The dimensional modeling approach is quite different from the normaliza-tion approach typically followed when designing a database for daily oper-ations. The context of data warehousing compels a different approach tomeeting the needs of the user. The need for dimensional modeling will be



Figure 8.2 Data warehouse life cycle (based heavily on Kimball and Ross [2002],Figure 16.1)

Project Planning

Business RequirementsDefinition

DimensionalModeling

Physical Design

Data Staging Designand Development

Deployment

Maintenance and Growth,detect requirement changes

TechnicalArchitecture

Design

Product Selectionand Installation

AnalyticApplicationSpecification

AnalyticApplication

Development

[plan aligned with business requirements]

[more dialog needed]

[adjustment needed] [system defunct]



discussed further as we proceed. If you haven’t been exposed to data ware-housing before, be prepared for some new paradigms.

The Star Schema

Data warehouses are commonly organized with one large central facttable, and many smaller dimension tables. This configuration is termed astar schema; an example is shown in Figure 8.3. The fact table is com-posed of two types of attributes: dimension attributes and measures. Thedimension attributes in Figure 8.3 are CustID, ShipDateID, BindID, andJobId. Most dimension attributes have foreign key/primary key relation-ships with dimension tables. The dimension tables in Figure 8.3 are Cus-tomer, Ship Calendar, and Bind Style. Occasionally, a dimensionattribute exists without a related dimension table. Kimball and Ross referto these as degenerate dimensions. The JobId attribute in Figure 8.3 is adegenerate dimension (more on this shortly). We indicate the dimen-sion attributes that act as foreign keys using the stereotype «fk». The pri-mary keys of the dimension tables are indicated with the stereotype«pk». Any degenerate dimensions in the fact table are indicated with thestereotype «dd». The fact table also contains measures, which containvalues to be aggregated when queries group rows together. The measuresin Figure 8.3 are Cost and Sell.

Queries against the star schema typically use attributes in the dimen-sion tables to select the pertinent rows from the fact table. For example,the user may want to see cost and sell for all jobs where the Ship Month

Figure 8.3 Example of a star schema for a data warehouse

Ship Calendar

«pk» ShipDateIDShip DateShip MonthShip QuarterShip YearShip Day of Week

Fact Table

«fk» CustID«fk» ShipDateID«fk» BindID«dd» JobIDCostSell

Customer

«pk» CustIDNameCustTypeCityState ProvinceCountry

Bind Style

«pk» BindIDBind DescBind Category

*

*

1

1

1*



is January 2005. The dimension table attributes are also typically used togroup the rows in useful ways when exploring summary information.For example, the user may wish to see the total cost and sell for eachShip Month in the Ship Year 2005. Notice that dimension tables canallow different levels of detail the user can examine. For example, theFigure 8.3 schema allows the fact table rows to be grouped by Ship Date,Month, Quarter or Year. These dimension levels form a hierachy. There isalso a second hierarchy in the Ship Calendar dimension that allows theuser to group fact table rows by the day of the week. The user can moveup or down a hierarchy when exploring the data. Moving down a hierar-chy to examine more detailed data is a drill-down operation. Moving up ahierarchy to summarize details is a roll-up operation.

Together, the dimension attributes compose a candidate key of thefact table. The level of detail defined by the dimension attributes is thegranularity of the fact table. When designing a fact table, the granularityshould be the most detailed level available that any user would wish toexamine. This requirement sometimes means that a degenerate dimen-sion, such as JobId in Figure 8.3, must be included. The JobId in this starschema is not used to select or group rows, so there is no related dimen-sion table. The purpose of the JobId attribute is to distinguish rows atthe correct level of granularity. Without the JobId attribute, the facttable would group together similar jobs, prohibiting the user from exam-ining the cost and sell values of individual jobs.

Normalization is not the guiding principle in data warehouse design.The purpose of data warehousing is to provide quick answers to queriesagainst a large set of historical data. Star schema organization facilitatesquick response to queries in the context of the data warehouse. The coredetailed data are centralized in the fact table. Dimensional informationand hierarchies are kept in dimension tables, a single join away from thefact table. The hierarchical levels of data contained in the dimensiontables of Figure 8.3 violate 3NF, but these violations to the principles ofnormalization are justified. The normalization process would break eachdimension table in Figure 8.3 into multiple tables. The resulting normal-ized schema would require more join processing for most queries. Thedimension tables are small in comparison to the fact table, and typicallyslow changing. The bulk of operations in the data warehouse are readoperations. The benefits of normalization are low when most operationsare read only. The benefits of minimizing join operations overwhelm thebenefits of normalization in the context of data warehousing. Themarked differences between the data warehouse environment and the



operational system environment lead to distinct design approaches.Dimensional modeling is the guiding principle in data warehousedesign.

Snowflake Schema

The data warehouse literature often refers to a variation of the starschema known as the snowflake schema. Normalizing the dimensiontables in a star schema leads to a snowflake schema. Figure 8.4 shows thesnowflake schema analogous to the star schema of Figure 8.3. Noticethat each hierarchical level becomes its own table. The snowflakeschema is generally losing favor. Kimball and Ross strongly prefer thestar schema, due to its speed and simplicity. Not only does the starschema yield quicker query response, it is also easier for the user tounderstand when building queries. We include the snowflake schemahere for completeness.

Dimensional Design Process

We adhere to the four-step dimensional design process promoted by Kim-ball and Ross. Figure 8.5 outlines the activities in the four-step process.

Dimensional Modeling Example

Congratulations, you are now the owner of the ACME Data Mart Com-pany! Your company builds data warehouses. You consult with othercompanies, design and deploy data warehouses to meet their needs, andsupport them in their efforts.

Your first customer is XYZ Widget, Inc. XYZ Widget is a manufactur-ing company with information systems in place. These are operationalsystems that track the current and recent state of the various businessprocesses. Older records that are no longer needed for operating theplant are purged. This keeps the operational systems running efficiently.

XYZ Widget is now ten years old, and growing fast. The managementrealizes that information is valuable. The CIO has been saving databefore they are purged from the operational system. There are tens ofmillions of historical records, but there is no easy way to access the datain a meaningful way. ACME Data Mart has been called in to design andbuild a DSS to access the historical data.

Discussions with XYZ Widget commence. There are many questionsthey want to have answered by analyzing the historical data. You beginby making a list of what XYZ wants to know.



Figure 8.4 Example of a snowflake schema for a data warehouse

Figure 8.5 Four step dimensional design process [Kimball and Ross, 2002]

Fact TableShip Date

Bind Style

Customer

Ship Day of Week

Ship Month

Ship Quarter

Ship Year

Bind Category

City

State Province

Country

Cust Type

Select a Business Process

Choose Dimensions

Identify Measures

Determine Granularity

[more business processes] [else]



XYZ Widget Company Wish List

1. What are the trends of our various products in terms of sales dol-lars, unit volume, and profit margin?

2. For those products that are not profitable, can we drill down anddetermine why they are not profitable?

3. How accurately do our estimated costs match our actual costs?

4. When we change our estimating calculations, how are sales andprofitability affected?

5. What are the trends in the percentage of jobs that ship on time?

6. What are the trends in productivity by department, for eachmachine, and for each employee?

7. What are the trends in meeting the scheduled dates for eachdepartment, and for each machine?

8. How effective was the upgrade on machine 123?

9. Which customers bring the most profitable jobs?

10. How do our promotional bulk discounts affect sales and profit-ability?

Looking over the wish list, you begin picking out the business pro-cesses involved. The following list is sufficient to satisfy the items on thewish list.

Business Processes

1. Estimating

2. Scheduling

3. Productivity Tracking

4. Job Costing

These four business processes are interlinked in the XYZ Widget Com-pany. Let’s briefly walk through the business processes and the organiza-tion of information in the operational systems, so we have an idea whatinformation is available for analysis. For each business process, we’lldesign a star schema for storing the data.

The estimating process begins by entering widget specifications. Thetype of widget determines which machines are used to manufacture thewidget. The estimating software then calculates estimated time on each



machine used to produce that particular type of widget. Each machine ismodeled with a standard setup time and running speed. If a particulartype of widget is difficult to process on a particular machine, the timesare adjusted accordingly. Each machine has an hourly rate. The esti-mated time is multiplied by the rate to give labor cost. Each estimatestores widget specifications, a breakdown of the manufacturing costs,the markup and discount applied (if any), and the price. The quote issent to the customer. If the customer accepts the quote, then the quoteis associated with a job number, the specifications are printed as a jobticket, and the job ticket moves to scheduling.

We need to determine the grain before designing a schema for theestimating data mart. The grain should be at the most detailed level, giv-ing the greatest flexibility for drill-down operations when users areexploring the data. The most granular level in the estimating process isthe estimating detail. Each estimating detail record specifies informationfor an individual cost center for a given estimate. This is the finest gran-ularity of estimating data in the operational system, and this level ofdetail is also potentially valuable for the data warehouse users.

The next design step is to determine the dimensions. Looking at theestimating detail, we see that the associated attributes are the job specifi-cations, the estimate number and date, the job number and win date ifthe estimate becomes a job, the customer, the promotion, the cost cen-ter, the widget quantity, estimated hours, hourly rate, estimated cost,markup, discount, and price. Dimensions are those attributes that theusers want to group by when exploring the data. The users are interestedin grouping by the various job specifications and by the cost center. Theusers also need to be able to group by date ranges. The estimate date andthe win date are both of interest. Grouping by customer and promotionare also of interest to the users. These become the dimensions of the starschema for the estimating process.

Next, we identify the measures. Measures are the columns that con-tain values to be aggregated when rows are grouped together. The mea-sures in the estimating process are estimated hours, hourly rate, esti-mated cost, markup, discount, and price.

The star schema resulting from the analysis of the estimating processis shown in Figure 8.6. There are five widget qualities of interest: shape,color, texture, density, and size. For example, a given widget might be amedium round red fuzzy fluffy widget. The estimate and job numbersare included as degenerate dimensions. The rest of the dimensions andmeasures are as outlined in the previous two paragraphs.



Dimension values are categorical in nature. For example, a givenwidget might have a density of fluffy or heavy. The values for the sizedimension include small, medium, and large. Measures tend to benumeric, since they are typically aggregated using functions such as sumor average.

The dimension tables should include any hierarchies that may beuseful for analysis. For example, widgets are offered in many colors. Thecolors fall into categories by hue (e.g., pink, blue) and intensity (e.g.,pastel, hot). Some even glow in the dark! The user may wish to examineall the pastel widgets as a group, or compare pink versus blue widgets.Including these attributes in the dimension table as shown in Figure 8.7can accommodate this need.

Figure 8.6 Star schema for estimating process

Figure 8.7 Color dimension showing attributes

«fk» shape id«fk» color id«fk» texture id«fk» density id«fk» size id«fk» estimate date id«dd» estimate number«fk» win date id«dd» job number«fk» customer id«fk» promotion id«fk» cost center idwidget quantityestimated hourshourly rateestimated costmarkupdiscountprice

Estimating DetailShape Color

Texture Density

Size Estimate Date

Customer

Promotion Cost Center

Win Date

Color

«pk» color idcolor descriptionhueintensityglows in dark



Dates can also form hierarchies. For example, the user may wish togroup by month, quarter, year or the day of the week. Date dimensionsare very common. The estimating process has two date dimensions: theestimate date and the win date. Typically, the date dimensions haveanalogous attributes. There is an advantage in standardizing the datedimensions across the company. Kimball and Ross [2002] recommendestablishing a single standard date dimension, and then creating viewsof the date dimension for use in multiple dimensions. The use of viewsprovides for standardization, while at the same time allowing theattributes to be named with aliases for intuitive use when multiple datedimensions are present. Figure 8.8 illustrates this concept with a datedimension and two views named Estimate Date and Win Date.

Let’s move on to the scheduling process. Scheduling uses the timescalculated by the estimating process to plan the workload on eachrequired machine. Target dates are assigned to each manufacturing step.The job ticket moves into production after the scheduling process com-pletes.

XYZ Widget, Inc. has a shop floor automatic data collection (ADC)system. Each job ticket has a bar code for the assigned job number. Eachmachine has a sheet with bar codes representing the various operationsof that machine. Each employee has a badge with a bar code represent-ing that employee. When an employee starts an operation, the job barcode is scanned, the operation bar code is scanned, and the employeebar code is scanned. The computer pulls in the current system time asthe start time. When one operation starts, the previous operation forthat employee is automatically stopped (an employee is unable do morethan one operation at once). When the work on the widget job is com-plete on that machine, the employee marks the job complete via theADC system. The information gathered through the ADC system is usedto update scheduling, track the employee’s work hours and productivity,and also track the machine’s productivity.

Figure 8.8 Date dimensions showing attributes

Date

«pk» date iddate descriptionmonthquarteryearday of week

Win Date

«pk» win date idwin date descriptionwin monthwin quarterwin yearwin day of week

Estimate Date

«pk» estimate date idestimate date descriptionestimate monthestimate quarterestimate yearestimate day of week



The design of a star schema for the scheduling process begins bydetermining the granularity. The most detailed scheduling table in theoperational system has a record for each cost center applicable to manu-facturing each job. The users in the scheduling department are inter-ested in drilling down to this level of detail in the data warehouse. Theproper level of granularity in the star schema for scheduling is deter-mined by the job number and the cost center.

Next we determine the dimensions in the star schema for the sched-uling process. The operational scheduling system tracks the scheduledstart and finish date and times, as well as the actual start and finish dateand times. The estimated and actual hours are also stored in the opera-tional scheduling details table, along with a flag indicating whether theoperation completed on time. The scheduling team must have the abil-ity to group records by the scheduled and actual start and finish times.Also critical is the ability to group by cost center. The dimensions of thestar schema for scheduling are the scheduled and actual start and finishdate and times, and the cost center. The job number must also beincluded as a degenerate dimension to maintain the proper granularityin the fact table. Figure 8.9 reflects the decisions on the dimensionsappropriate for the scheduling process.

The scheduling team is interested in aggregating the estimated hoursand, also, the actual hours. They are also very interested in examiningtrends in on-time performance. The appropriate measures for the sched-uling star schema include the estimated and actual hours and a flag indi-cating whether the operation was finished on time. The appropriatemeasures for scheduling are reflected in Figure 8.9.

Figure 8.9 Star schema for the scheduling process

«dd» job number«fk» cost center id«fk» sched start date id«fk» sched start time id«fk» sched finish date id«fk» sched finish time id«fk» actual start date id«fk» actual start time id«fk» actual finish date id«fk» actual finish time idfinished on timeestimated hoursactual hours

Scheduling DetailCost Center

Sched Start Date

Actual Start DateSched Start Time

Actual Start Time

Actual Finish DateSched Finish Time

Actual Finish Time

Sched Finish Date



There are several standardization principles in play in Figure 8.9.Note that there are multiple time dimensions. These should be standard-ized with a single time dimension, along with views filling the differentroles, similar to the approach used for the date dimensions. Also, noticethe Cost Center dimension is present both in the estimating and thescheduling processes. These are actually the same, and should bedesigned as a single dimension. Dimensions can be shared between mul-tiple star schemas. One last point: the estimated hours are carried fromestimating into scheduling in the operational systems. These numbersfeed into the star schemas for both the estimating and the schedulingprocesses. The meaning is the same between the two attributes; there-fore, they are both named “estimated hours.” The rule of thumb is thatif two attributes carry the same meaning, they should be named thesame, and if two attributes are named the same, they carry the samemeaning. This consistency allows discussion and comparison of infor-mation between business processes across the company.

The next process we examine is productivity tracking. The granular-ity is determined by the level of detail available in the ADC system. Thedetail includes the job number, cost center, employee number, and thestart and finish date and time. The department managers need to be ableto group rows by cost center, employee, and start and finish date andtimes. These attributes therefore become the dimensions of the starschema for the productivity process, shown in Figure 8.10. The manag-ers are interested in aggregating productivity numbers, including thewidget quantity produced, the percentage finished on time and the esti-mated and actual hours. Since these attributes are to be aggregated, theybecome the measures shown in Figure 8.10.

Figure 8.10 Star schema for the productivity tracking process

«dd» job number«fk» cost center id«fk» employee id«fk» actual start date id«fk» actual start time id«fk» actual finish date id«fk» actual finish time idwidget quantityfinished on timeestimated hoursactual hours

Productivity DetailCost Center

Employee Actual Start Date

Actual Start Time

Actual Finish Date

Actual Finish Time



There are often dimensions in common between star schemas in adata warehouse, because business processes are usually interlinked. Auseful tool for tracking the commonality and differences of dimensionsacross multiple business processes is the data warehouse bus [Kimballand Ross, 2002]. Table 8.2 shows a data warehouse bus for the four busi-ness processes in our dimensional design example. Each row represents abusiness process. Each column represents a dimension. Each X in thebody of the table represents the use of the given dimension in the givenbusiness process. The data warehouse bus is a handy means of present-ing the organization of a data warehouse at a high level. The dimensionscommon between multiple business processes need to be standardizedor “conformed” in Kimball and Ross’s terminology. A dimension is con-formed if there exists a most detailed version of that dimension, and allother uses of that dimension utilize a subset of the attributes and a sub-set of the rows from that most detailed version. Conforming dimensionsensures that whenever data are related or compared across business pro-cesses, the result is meaningful.

The data warehouse bus also makes some design decisions moreobvious. We have taken the liberty of choosing the dimensions for thejob-costing process. Table 8.2 includes a row for the job-costing process.When you compare the rows for estimating and job costing, it quicklybecomes clear that the two processes have most of the same dimensions.It probably makes sense to combine these two processes into one star

Table 8.2 Data Warehouse Bus for Widget Example

Shap

e

Col

or

Text

ure

Den

sity

Size

Esti

mat

e D

ate

Win

Dat

e

Cus

tom

er

Prom

otio

n

Cos

t C

ente

r

Sche

d St

art

Dat

e

Sche

d St

art

Tim

e

Sche

d Fi

nish

Dat

e

Sche

d Fi

nish

Tim

e

Act

ual S

tart

Dat

e

Act

ual S

tart

Tim

e

Act

ual F

inis

h D

ate

Act

ual F

inis

h Ti

me

Empl

oyee

Invo

ice

Dat

e

Estimating X X X X X X X X X X

Scheduling X X X X X X X X X

Productivity Tracking

X X X X X X

Job Costing X X X X X X X X X



schema. This is especially true since job-costing analysis requires com-paring estimated and actual values. Figure 8.11 is the result of combin-ing the estimating and job costing processes into one star schema.

Summarizing Data

The star schemas we have covered so far are excellent for capturing thepertinent details. Having fine granularity available in the fact tableallows the users to examine data down to that level of granularity. How-ever, the users will often want summaries. For example, the managersmay often query for a daily snapshot of the job-costing data. Everyquery the user may wish to pose against a given star schema can beanswered from the detailed fact table. The summary could be aggregatedon the fly from the fact table. There is an obvious drawback to this strat-egy. The fact table contains many millions of rows, due to the detailednature of the data. Producing a summary on the fly can be expensive interms of computer resources, resulting in a very slow response. If a sum-mary table were available to answer the queries for the job costing dailysnapshot, then the answer could be presented to the user blazingly fast.

Figure 8.11 Star schema for the job costing process

«fk» shape id«fk» color id«fk» texture id«fk» density id«fk» size id«fk» estimate date id«dd» estimate number«fk» win date id«dd» job number«fk» customer id«fk» promotion id«fk» cost center id«fk» invoice date idwidget quantityestimated hourshourly rateestimated costmarkupdiscountpriceactual hoursactual cost

Job Costing DetailShape Color

Texture Density

Size Estimate Date

Customer

Promotion Cost Center

Win Date

Invoice Date



The schema for the job costing daily snapshot is shown in Figure 8.12.Notice that most of the dimensions used in the job-costing detail are notused in the snapshot. Summarizing the data has eliminated the need formost dimensions in this context. The daily snapshot contains one rowfor each day that jobs have been invoiced. The number of rows in thesnapshot would be in the thousands. The small size of the snapshotallows very quick response when a user requests the job costing dailysnapshot. When there are a small number of summary queries thatoccur frequently, it is a good strategy to materialize the summary dataneeded to answer the queries quickly.

The daily snapshot schema in Figure 8.12 also allows the user togroup by month, quarter, or year. Materializing summary data is usefulfor quick response to any query that can be answered by aggregating thedata further.

8.2 Online Analytical Processing (OLAP)

Designing and implementing strategic summary tables is a goodapproach when there is a small set of frequent queries for summary data.However, there may be a need for some users to explore the data in an adhoc fashion. For example, a user who is looking for types of jobs thathave not been profitable needs to be able to roll up and drill down vari-ous dimensions of the data. The ad hoc nature of the process makes pre-dicting the queries impossible. Designing a strategic set of summarytables to answer these ad hoc explorations of the data is a daunting task.OLAP provides an alternative. OLAP is a service that overlays the datawarehouse. The OLAP system automatically selects a strategic set of sum-mary views, and saves the automatic summary tables (AST) to disk asmaterialized views. The OLAP system also maintains these views, keep-

Figure 8.12 Schema for the job costing daily snapshot

«fk» invoice date idwidget quantityestimated hoursestimated costpriceactual hoursactual cost

Job Costing Daily Snapshot

Invoice Date


8.2 Online Analytical Processing (OLAP) 167

ing them in step with the fact tables as new data arrives. When a userrequests summary data, the OLAP system figures out which AST can beused for a quick response to the given query. OLAP systems are a goodsolution when there is a need for ad hoc exploration of summary infor-mation based on large amounts of data residing in a data warehouse.

OLAP systems automatically select, maintain, and use the ASTs.Thus, an OLAP system effectively does some of the design work auto-matically. This section covers some of the issues that arise in building anOLAP engine, and some of the possible solutions. If you use an OLAPsystem, the vendor delivers the OLAP engine to you. The issues and solu-tions discussed here are not items that you need to resolve. Our goalhere is to remove some of the mystery about what an OLAP system isand how it works.

8.2.1 The Exponential Explosion of Views

Materialized views aggregated from a fact table can be uniquely identi-fied by the aggregation level for each dimension. Given a hierarchyalong a dimension, let 0 represent no aggregation, 1 represent the firstlevel of aggregation, and so on. For example, if the Invoice Date dimen-sion has a hierarchy consisting of date id, month, quarter, year and “all”(i.e., complete aggregation), then date id is level 0, month is level 1,quarter is level 2, year is level 3, and “all” is level 4. If a dimension doesnot explicitly have a hierarchy, then level 0 is no aggregation, and level1 is “all.” The scales so defined along each dimension define a coordi-nate system for uniquely identifying each view in a product graph. Fig-ure 8.13 illustrates a product graph in two dimensions. Product graphsare a generalization of the hypercube lattice structure introduced byHarinarayan, Rajaraman, and Ullman [1996], where dimensions mayhave associated hierarchies. The top node, labeled (0, 0) in Figure 8.13,represents the fact table. Each node represents a view with aggregationlevels as indicated by the coordinate. The relationships descending theproduct graph indicate aggregation relationships. The five shaded nodesindicate that these views have been materialized. A view can be aggre-gated from any materialized ancestor view. For example, if a user issues aquery for rows grouped by year and state, that query would naturally beanswered by the view labeled (3, 2). View (3, 2) is not materialized, butthe query can be answered from the materialized view (2, 1) since (2, 1)is an ancestor of (3, 2). Quarters can be aggregated into years, and citiescan be aggregated into states.



The central issue challenging the design of OLAP systems is theexponential explosion of possible views as the number of dimensionsincreases. The Calendar dimension in Figure 8.13 has five levels of hier-archy, and the Customer dimension has four levels of hierarchy. Theuser may choose any level of aggregation along each dimension. Thenumber of possible views is the product of the number of hierarchicallevels along each dimension. The number of possible views for theexample in Figure 8.13 is 5 × 4 = 20. Let d be the number of dimensionsin a data warehouse. Let hi be the number of hierarchical levels indimension i. The general equation for calculating the number of possi-ble views is given by Equation 8.1.

Possible views = 8.1

If we express Equation 8.1 in different terms, the problem of expo-nential explosion becomes more apparent. Let g be the geometric mean

Figure 8.13 Product graph labeled with aggregation level coordinates

Calendar Dimension(first dimension)

0: date id1: month2: quarter3: year4: all

Customer Dimension(second dimension)

0: cust id1: city2: state3: all

(0, 0)

(1, 0) (0, 1)

(1, 1) (0, 2)

(1, 2) (0, 3)

(1, 3)

(2, 0)

(2, 1)

(2, 2)

(2, 3)

(3, 0)

(3, 1)

(3, 2)

(3, 3)

(4, 0)

(4, 1)

(4, 2)

(4, 3)

Fact Table

hi

i 1=

d

∏



of the number of hierarchical levels in the dimensions. Then Equation8.1 becomes Equation 8.2.

Possible views = gd 8.2

As dimensionality increases linearly, the number of possible viewsexplodes exponentially. If g = 5 and d = 5, there are 55 = 3,125 possibleviews. Thus if d = 10, then there are 510 = 9,765,625 possible views.OLAP administrators need the freedom to scale up the dimensionality oftheir data warehouses. Clearly the OLAP system cannot create and main-tain all possible views as dimensionality increases. The design of OLAPsystems must deliver quick response while maintaining a system withinthe resource limitations. Typically, a strategic subset of views must beselected for materialization.

8.2.2 Overview of OLAP

There are many approaches to implementing OLAP systems presented inthe literature. Figure 8.14 maps out one possible approach, which willserve for discussion. The larger problem of OLAP optimization is brokeninto four subproblems: view size estimation, materialized view selection,materialized view maintenance, and query optimization with material-ized views. This division is generally true of the OLAP literature, and isreflected in the OLAP system plan shown in Figure 8.14.

We describe how the OLAP processes interact in Figure 8.14, andthen explore each process in greater detail. The plan for OLAP optimiza-tion shows Sample Data moving from the Fact Table into View Size Esti-mation. View Selection makes an Estimate Request for the view size of eachview it considers for materialization. View Size Estimation queries theSample Data, examines it, and models the distribution. The distributionobserved in the sample is used to estimate the expected number of rowsin the view for the full dataset. The Estimated View Size is passed to ViewSelection, which uses the estimates to evaluate the relative benefits ofmaterializing the various views under consideration. View Selection picksStrategically Selected Views for materialization with the goal of minimiz-ing total query costs. View Maintenance builds the original views fromthe Initial Data from the Fact Table, and maintains the views as Incremen-tal Data arrives from Updates. View Maintenance sends statistics on ViewCosts back to View Selection, allowing costly views to be discardeddynamically. View Maintenance offers Current Views for use by Query Opti-mization. Query Optimization must consider which of the Current Views



can be utilized to most efficiently answer Queries from Users, givingQuick Responses to the Users. View Usage feeds back into View Selection,allowing the system to dynamically adapt to changes in query work-loads.

8.2.3 View Size Estimation

OLAP systems selectively materialize strategic views with high benefitsto achieve quick response to queries, while remaining within theresource limits of the computer system. The size of a view affects howmuch disk space is required to store the view. More importantly, the sizeof the view determines in part how much disk input/output will be con-sumed when querying and maintaining the view. Calculating the exactsize of a given view requires calculating the view from the base data.Reading the base data and calculating the view is the majority of thework necessary to materialize the view. Since the objective of view mate-rialization is to conserve resources, it becomes necessary to estimate thesize of the views under consideration for materialization.

Cardenas’ formula [Cardenas, 1975] is a simple equation (Equation8.3) that is applicable to estimating the number of rows in a view:

Figure 8.14 A plan for OLAP optimization

Fact Table

Updates

Sample Data

EstimatedView Size

StrategicallySelected Views

Current Views

Incremental Data

QueriesQuick Responses

EstimateRequest

View Size Estimation

View Selection

View Maintenance

Initial Data

View Usage

Users Query Optimization

View Costs



Let n be the number of rows in the fact table.

Let v be the number of possible keys in the data space of the view.

Expected distinct values = v(1 – (1 – 1/v)n) 8.3

Cardenas’ formula assumes a uniform data distribution. However,many data distributions exist. The data distribution in the fact tableaffects the number of rows in a view. Cardenas’ formula is very quick,but the assumption of a uniform data distribution leads to gross overesti-mates of the view size when the data is actually clustered. Other meth-ods have been developed to model the effect of data distribution on thenumber of rows in a view.

Faloutsos, Matias, and Silberschatz [1996] present a samplingapproach based on the binomial multifractal distribution. Parameters ofthe distribution are estimated from a sample. The number of rows in theaggregated view for the full data set is then estimated using the parame-ter values determined from the sample. Equations 8.4 and 8.5 [Faloutsos,Matias, and Silberschatz, 1996] are presented for this purpose.

Expected distinct values = 8.4

Pa = Pk–a(1 – P)a 8.5

Figure 8.15 illustrates an example. Order k is the decision tree depth.Ck

a is the number of bins in the set reachable by taking some combina-tion of a left hand edges and k – a right hand edges in the decision tree.Pa is the probability of reaching a given bin whose path contains a lefthand edges. n is the number of rows in the data set. Bias P is the proba-bility of selecting the right hand edge at a choice point in the tree.

The calculations of Equation 8.4 are illustrated with a small example.An actual database would yield much larger numbers, but the conceptsand the equations are the same. These calculations can be done with log-arithms, resulting in very good scalability. Based on Figure 8.15, givenfive rows, calculate the expected distinct values using Equation 8.4:

Expected distinct values =

1 ⋅ (1 – (1 – 0.729)5) + 3 ⋅ (1 – (1 – 0.081)5) +

3 ⋅ (1 – (1 – 0.009)5) + 1 ⋅ (1 – (1 – 0.001)5) ≈1.965 8.6

Cak 1 1 Pa–( )n

–( )a 0=

k

∑



The values of P and k can be estimated based on sample data. Thealgorithm used in [Faloutsos, Matias, and Silberschatz, 1996] has threeinputs: the number of rows in the sample, the frequency of the mostcommonly occurring value, and the number of distinct aggregate rowsin the sample. The value of P is calculated based on the frequency of themost commonly occurring value. They begin with:

k = ⎡Log2(Distinct rows in sample)⎤ 8.7

and then adjust k upwards, recalculating P until a good fit to the numberof distinct rows in the sample is found.

Other distribution models can be utilized to predict the size of a viewbased on sample data. For example, the use of the Pareto distributionmodel has been explored [Nadeau and Teorey, 2003]. Another possibilityis to find the best fit to the sample data for multiple distribution models,calculate which model is most likely to produce the given sample data,and then use that model to predict the number of rows for the full dataset. This would require calculation for each distribution model consid-ered, but should generally result in more accurate estimates.

Figure 8.15 Example of a binomial multifractal distribution tree

P = 0.9

0.9 0.9

0.9 0.9 0.9 0.9

0.1

0.1

0.1

0.1

0.1 0.1 0.1

0.7290.0810.081 0.0810.0090.009 0.0090.001

a = 2 leftsP2 = 0.009

a = 1 leftP1 = 0.081

a = 0 leftsP0 = 0.729

C = 1 binC = 3 bins32

C = 3 bins31

30

a = 3 leftsP3 = 0.001

C = 1 bin33



8.2.4 Selection of Materialized Views

Most of the published works on the problem of materialized view selec-tion are based on the hypercube lattice structure [Harinarayan, Rajara-man, and Ullman, 1996]. The hypercube lattice structure is a special caseof the product graph structure, where the number of hierarchical levelsfor each dimension is two. Each dimension can either be included orexcluded from a given view. Thus, the nodes in a hypercube lattice struc-ture represent the power set of the dimensions.

Figure 8.16 illustrates the hypercube lattice structure with an exam-ple [Harinarayan, Rajaraman, and Ullman, 1996]. Each node of the lat-tice structure represents a possible view. Each node is labeled with the setof dimensions in the “group by” list for that view. The numbers associ-ated with the nodes represent the number of rows in the view. Thesenumbers are normally derived from a view size estimation algorithm, asdiscussed in Section 8.2.3. However, the numbers in Figure 8.16 followthe example as given by Harinarayan et al. [1996]. The relationshipsbetween nodes indicate which views can be aggregated from otherviews. A given view can be calculated from any materialized ancestorview.

We refer to the algorithm for selecting materialized views introducedby Harinarayan et al. [1996] as HRU. The initial state for HRU has onlythe fact table materialized. HRU calculates the benefit of each possibleview during each iteration, and selects the most beneficial view formaterialization. Processing continues until a predetermined number ofmaterialized views is reached.

Figure 8.16 Example of a hypercube lattice structure [Harinarayan et al. 1996]

c = Customerp = Parts = Supplier

{p, s} 0.8M {c, s} 6M {c, p} 6M

{s} 0.01M {p} 0.2M {c} 0.1M

{ } 1

Fact Table

{c, p, s} 6M



Table 8.3 shows the calculations for the first two iterations of HRU.Materializing {p, s} saves 6M – 0.8M = 5.2M rows for each of four views:{p, s} and its three descendants: {p}, {s}, and {}. The view {c, s} yields nobenefit materialized, since any query that can be answered by reading6M rows from {c, s} can also be answered by reading 6M rows from thefact table {c, p, s}. HRU calculates the benefits of each possible view mate-rialization. The view {p, s} is selected for materialization in the first itera-tion. The view {c} is selected in the second iteration.

HRU is a greedy algorithm that does not guarantee an optimal solu-tion, although testing has shown that it usually produces a good solu-tion. Further research has built upon HRU, accounting for the presenceof index structures, update costs, and query frequencies.

HRU evaluates every unselected node during each iteration, andeach evaluation considers the effect on every descendant. The algorithmconsumes O(kn2) time, where k = |views to select| and n = |nodes|. Thisorder of complexity looks very good; it is polynomial time. However, theresult is misleading. The nodes of the hypercube lattice structure consti-tute a power set. The number of possible views is therefore 2d where d =|dimensions|. Thus, n = 2d, and the time complexity of HRU is O(k22d).HRU runs in time exponentially relative to the number of dimensions inthe database.

The Polynomial Greedy Algorithm (PGA) [Nadeau and Teorey, 2002]offers a more scalable alternative to HRU. PGA, like HRU, also selects oneview for materialization with each iteration. However, PGA divides eachiteration into a nomination phase and a selection phase. The first phasenominates promising views into a candidate set. The second phase esti-mates the benefits of materializing each candidate, and selects the viewwith the highest evaluation for materialization.

Table 8.3 Two Iterations of HRU, Based on Figure 8.16

Iteration 1 Benefit Iteration 2 Benefit

{p, s}

{c, s}

{c, p}

{s}

{p}

{c}

{}

5.2M × 4 = 20.8M

0 × 4 = 0

0 × 4 = 0

5.99M × 2 = 11.98M

5.8M × 2 = 11.6M

5.9M × 2 = 11.8M

6M – 1

0 × 2 = 0

0 × 2 = 0

0.79M × 2 = 1.58M

0.6M × 2 = 1.2M

5.9M × 2 = 11.8M

0.8M – 1



The nomination phase begins at the top of the lattice; in Figure 8.16,this is the node {c, p, s}. PGA nominates the smallest node from amongstthe children. The candidate set is now {{p, s}}. PGA then examines thechildren of {p, s} and nominates the smallest child, {s}. The processrepeats until the bottom of the lattice is reached. The candidate set isthen {{p, s}, {s}, {}}. Once a path of candidate views has been nominated,the algorithm enters the selection phase. The resulting calculations areshown in Tables 8.4 and 8.5.

Compare Tables 8.4 and 8.5 with Table 8.3. Notice PGA does fewercalculations than HRU, and yet in this example reaches the same deci-sions as HRU. PGA usually picks a set of views nearly as beneficial asthose chosen by HRU, and yet PGA is able to function when HRU failsdue to the exponential complexity. PGA is polynomial relative to thenumber of dimensions. When HRU fails, PGA extends the usefulness ofthe OLAP system.

The materialized view selection algorithms discussed so far are static;that is, the views are picked once and then materialized. An entirely dif-ferent approach to the selection of materialized views is to treat theproblem similar to memory management [Kotidis and Roussopoulos,1999]. The materialized views constitute a view pool. Metadata is trackedon usage of the views. The system monitors both space and update win-dow constraints. The contents of the view pool are adjusted dynami-cally. As queries are posed, views are added appropriately. Whenever aconstraint is violated, the system selects a view for eviction. Thus the

Table 8.4 First Iteration of PGA, Based on Figure 8.16

Candidates Iteration 1 Benefit

{p, s}

{s}

{}

5.2M × 4 = 20.8M

5.99M × 2 = 11.98M

6M – 1

Table 8.5 Second Iteration of PGA, Based on Figure 8.16

Candidates Iteration 2 Benefit

{c, s}

{s}

{c}

{}

0 × 2 = 0

0.79M × 2 = 1.58M

5.9M × 2 = 11.8M

6M – 1



view pool can improve as more usage statistics are gathered. This is aself-tuning system that adjusts to changing query patterns.

The static and dynamic approaches complement each other andshould be integrated. Static approaches run fast from the beginning, butdo not adapt. Dynamic view selection begins with an empty view pool,and therefore yields slow response times when a data warehouse is firstloaded; however, it is adaptable and improves over time. The comple-mentary nature of these two approaches has influenced our design planin Figure 8.14, as indicated by Queries feeding back into View Selection.

8.2.5 View Maintenance

Once a view is selected for materialization, it must be computed andstored. When the base data is updated, the aggregated view must also beupdated to maintain consistency between views. The original view mate-rialization and the incremental updates are both considered as viewmaintenance in Figure 8.14. The efficiency of view maintenance isgreatly affected by the data structures implementing the view. OLAP sys-tems are multidimensional, and fact tables contain large numbers ofrows. The access methods implementing the OLAP system must meetthe challenges of high dimensionality in combination with large rowcounts. The physical structures used are deferred to volume two, whichcovers physical design.

Most of the research papers in the area of view maintenance assumethat new data is periodically loaded with incremental data during desig-nated update windows. Typically, the OLAP system is made unavailableto the users while the incremental data is loaded in bulk, taking advan-tage of the efficiencies of bulk operations. There is a down side to defer-ring the loading of incremental data until the next update window. Ifthe data warehouse receives incremental data once a day, then there is aone-day latency period.

There is currently a push in the industry to accommodate dataupdates close to real time, keeping the data warehouse in step with theoperational systems. This is sometimes referred to as “active warehous-ing” and “real-time analytics.” The need for data latency of only a fewminutes presents new problems. How can very large data structures bemaintained efficiently with a trickle feed? One solution is to have a sec-ond set of data structures with the same schema as the data warehouse.This second set of data structures acts as a holding tank for incrementaldata, and is referred to as a delta cube in OLAP terminology. The opera-tional systems feed into the delta cube, which is small and efficient for



quick incremental changes. The data cube is updated periodically fromthe delta cube, taking advantage of bulk operation efficiencies. Whenthe user queries the OLAP system, the query can be issued against boththe data cube and the delta cube to obtain an up-to-date result. The deltacube is hidden from the user. What the user sees is an OLAP system thatis nearly current with the operational systems.

8.2.6 Query Optimization

When a query is posed to an OLAP system, there may be multiple mate-rialized views available that could be used to compute the result. Forexample, if we have the situation represented in Figure 8.13, and a userissues a query to group rows by month and state, that query is naturallyanswered from the view labeled (1, 2). However, since (1, 2) is not mate-rialized, we need to find a materialized ancestor to obtain the data.There are three such nodes in the product graph of Figure 8.13. Thequery can be answered from nodes (0, 0), (1, 0), or (0, 2). With the possi-bility of answering queries from alternative sources, the optimizationissue arises as to which source is the most efficient for the given query.Most existing research focuses on syntactic approaches. The possiblequery translations are carried out, alternative query costs are estimated,and what appears to be the best plan is executed. Another approach is toquery a metadata table containing information on the materializedviews to determine the best view to query against, and then translate theoriginal SQL query to use the best view.

Database systems contain metadata tables that hold data about thetables and other structures used by the system. The metadata tables facil-itate the system in its operations. Here’s an example where a metadata

Table 8.6 Example of Materialized View Metadata

Dimensions

Calendar Customer Blocks ViewID

0 0 10,000,000 1

0 2 50,000 3

0 3 1,000 5

1 0 300,000 2

2 1 10,000 4



table can facilitate the process of finding the best view to answer a queryin an OLAP system. The coordinate system defined by the aggregationlevels forms the basis for organizing the metadata for tracking the mate-rialized views. Table 8.6 displays the metadata for the materialized viewsshaded in Figure 8.13. The two dimensions labeled Calendar and Cus-tomer form the composite key. The Blocks column tracks the actual num-ber of blocks in each materialized view. The ViewID column is used toidentify the associated materialized view. The implementation storesmaterialized views as tables where the value of the ViewID forms part ofthe table name. For example, the row with ViewID = 3 contains informa-tion on the aggregated view that is materialized as table AST3 (short forautomatic summary table 3).

Observe the general pattern in the coordinates of the views in theproduct graph with regard to ancestor relationships. Let Value(V, d) rep-resent a function that returns the aggregation level for view V alongdimension d. For any two views Vi and Vj where Vi ≠ Vj, Vi is an ancestorof Vj if and only if for every dimension d of the composite key, Value(Vi,d) ≤ Value(Vj, d). This pattern in the keys can be utilized to identifyancestors of a given view by querying the metadata. The semantics ofthe product graph are captured by the metadata, permitting the OLAPsystem to search semantically for the best materialized ancestor view byquerying the metadata table. After the best materialized view is deter-mined, the OLAP system can rewrite the original query to utilize the bestmaterialized view, and proceed.

8.3 Data Mining

Two general approaches are used to extract knowledge from a database.First, a user may have a hypothesis to verify or disprove. This type ofanalysis is done with standard database queries and statistical analysis.The second approach to extracting knowledge is to have the computersearch for correlations in the data, and present promising hypotheses tothe user for consideration. The methods included here are data miningtechniques developed in the fields of Machine Learning and KnowledgeDiscovery.

Data mining algorithms attempt to solve a number of commonproblems. One general problem is categorization: given a set of caseswith known values for some parameters, classify the cases. For example,given observations of patients, suggest a diagnosis. Another generalproblem type is clustering: given a set of cases, find natural groupings ofthe cases. Clustering is useful, for example, in identifying market seg-


8.3 Data Mining 179

ments. Association rules, also known as market basket analyses, areanother common problem. Businesses sometimes want to know whatitems are frequently purchased together. This knowledge is useful, forexample, when decisions are made about how to lay out a grocery store.There are many types of data mining available. Han and Kamber [2001]cover data mining in the context of data warehouses and OLAP systems.Mitchell [1997] is a rich resource, written from the machine learningperspective. Witten and Frank [2000] give a survey of data mining, alongwith freeware written in Java available from the Weka Web site [http://www.cs.waikato.ac.nz/ml/weka]. The Weka Web site is a good option forthose who wish to experiment with and modify existing algorithms. Themajor database vendors also offer data mining packages that functionwith their databases.

Due to the large scope of data mining, we focus on two forms of datamining: forecasting and text mining.

8.3.1 Forecasting

Forecasting is a form of data mining in which trends are modeled overtime using known data, and future trends are predicted based on themodel. There are many different prediction models with varying levelsof sophistication. Perhaps the simplest model is the least squares linemodel. The best fit line is calculated from known data points using themethod of least squares. The line is projected into the future to deter-mine predictions. Figure 8.17 shows a least squares line for an actualdata set. The crossed (jagged) points represent actual known data. Thecircular (dots) points represent the least squares line. When the leastsquares line projects beyond the known points, this region representspredictions. The intervals associated with the predictions in our figuresrepresent a 90% prediction interval. That is, given an interval, there is a90% probability that the actual value, when known, will lie in thatinterval.

The least squares line approach weights each known data pointequally when building the model. The predicted upward trend in Figure8.17 does not give any special consideration to the recent downturn.

Exponential smoothing is an approach that weights recent historymore heavily than distant history. Double exponential smoothing mod-els two components: level and trend (hence “double” exponentialsmoothing). As the known values change in level and trend, the modeladapts. Figure 8.18 shows the predictions made using double exponen-



tial smoothing, based on the same data set used to compute Figure 8.17.Notice the prediction is now more tightly bound to recent history.

Triple exponential smoothing models three components: level,trend, and seasonality. This is more sophisticated than double exponen-tial smoothing, and gives better predictions when the data does indeedexhibit seasonal behavior. Figure 8.19 shows the predictions made by tri-ple exponential smoothing, based on the same data used to computeFigures 8.17 and 8.18. Notice the prediction intervals are tighter than inFigures 8.17 and 8.18. This is a sign that the data varies seasonally; tripleexponential smoothing is a good model for the given type of data.

Exactly how reliable are these predictions? If we revisit the predic-tions after time has passed and compare the predictions with the actualvalues, are they accurate? Figure 8.20 shows the actual data overlaidwith the predictions made in Figure 8.19. Most of the actual data pointsdo indeed lie within the prediction intervals. The prediction intervalslook very reasonable. Why don’t we use these forecast models to makeour millions on Wall Street? Take a look at Figure 8.21, a cautionary tale.Figure 8.21 is also based on the triple exponential smoothing model,using four years of known data for training, compared with five years ofdata used in constructing the model for Figure 8.20. The resulting pre-

Figure 8.17 Least squares line (courtesy of Ubiquiti, Inc.)


8.3 Data Mining 181

dictions match for four months, and then diverge greatly from reality.The problem is that forecast models are built on known data, with theassumption that known data forms a good basis for predicting thefuture. This may be true most of the time; however, forecast models canbe unreliable when the market is changing or about to change drasti-cally. Forecasting can be a useful tool, but the predictions must be takenonly as indicators.

The details of the forecast models discussed here, as well as manyothers, can be found in Makridakis et al. [1998].

8.3.2 Text Mining

Most of the work on data processing over the past few decades has usedstructured data. The vast majority of systems in use today read and storedata in relational databases. The schemas are organized neatly in rowsand columns. However, there are large amounts of data that reside infreeform text. Descriptions of warranty claims are written in text. Medi-cal records are written in text. Text is everywhere. Only recently has thework in text analysis made significant headway. Companies are nowmarketing products that focus on text analysis.

Figure 8.18 Double exponential smoothing (courtesy of Ubiquiti, Inc.)



Figure 8.19 Triple exponential smoothing (courtesy of Ubiquiti, Inc.)

Figure 8.20 Triple exponential smoothing with actual values overlaying forecast val-ues, based on five years of training data (courtesy of Ubiquiti, Inc.)


8.3 Data Mining 183

Let’s look at a few of the possibilities for analyzing text and theirpotential impact. We’ll take the area of automotive warranty claims asan example. When something goes wrong with your car, you bring itto an automotive shop for repairs. You describe to a shop representa-tive what you’ve observed going wrong with your car. Your descriptionis typed into a computer. A mechanic works on your car, and thentypes in observations about your car and the actions taken to remedythe problem. This is valuable information for the automotive compa-nies and the parts manufacturers. If the information can be analyzed,they can catch problems early and build better cars. They can reducebreakdowns, saving themselves money, and saving their customersfrustration.

The data typed into the computer is often entered in a hurry. Thelanguage includes abbreviations, jargon, misspelled words, and incorrectgrammar. Figure 8.22 shows an example entry from an actual warrantyclaim database.

As you can see, the raw information entered on the shop floor isbarely English. Figure 8.23 shows a cleaned up version of the same text.

Figure 8.21 Triple exponential smoothing with actual values overlaying forecast val-ues, based on four years of training data (courtesy of Ubiquiti, Inc.)



Even the cleaned up version is difficult to read. The companies pay-ing out warranty claims want each claim categorized in various ways, totrack what problems are occurring. One option is to hire many people toread the claims and determine how each claim should be categorized.Categorizing the claims manually is tedious work. A more viable option,developed in the last few years, is to apply a software solution. Figure8.24 shows some of the information that can be gleaned automaticallyfrom the text in Figure 8.22.

The software processes the text and determines the concepts likelyrepresented in the text. This is not a simple word search. Synonyms map

Figure 8.22 Example of a verbatim description in a warranty claim (courtesy ofUbiquiti, Inc.)

Figure 8.23 Cleaned up version of description in warranty claim (courtesy ofUbiquiti, Inc.)

Figure 8.24 Useful information extracted from verbatim description in warrantyclaim (courtesy of Ubiquiti, Inc.)

7 DD40 BASC 54566 CK OUT AC INOP PREFORM PID CK CK PCM PID ACC CK OK OPERATING ON AND OFF PREFORM POWER AND GRONED CK AT COMPRESOR FONED NO GRONED PREFORM PINPONT DIAG AND TRACE GRONED FONED BAD CO NECTION AT S778 REPAIR AND RETEST OK CK AC OPERATION

7 DD40 Basic 54566 Check Out Air Conditioning Inoperable Perform PID Check Check Power Control Module PID Accessory Check OK Operating On And Off Perform Power And Ground Check At Compressor Found No Ground Perform Pinpoint Diagnosis And Trace Ground Found Bad Connection At Splice 778 Repair And Retest OK Check Air Conditioning Operation.

Primary Group: ElectricalSubgroup: Climate ControlPart: Connector 1008Problem: Bad ConnectionRepair: ReconnectLocation: Engin. Cmprt.

90 %85 %93 %72 %75 %90 %

Automated Coding Confidence


8.4 Summary 185

to the same concept. Some words map to different concepts dependingon the context. The software uses an ontology that relates words andconcepts to each other. After each warranty is categorized in variousways, it becomes possible to obtain useful aggregate information, asshown in Figure 8.25.

8.4 Summary

Data warehousing, OLAP, and data mining are three areas of computerscience that are tightly interlinked and marketed under the heading ofbusiness intelligence. The functionalities of these three areas comple-ment each other. Data warehousing provides an infrastructure for stor-ing and accessing large amounts of data in an efficient and user-friendlymanner. Dimensional data modeling is the approach best suited fordesigning data warehouses. OLAP is a service that overlays the datawarehouse. The purpose of OLAP is to provide quick response to ad hocqueries, typically involving grouping rows and aggregating values. Roll-up and drill-down operations are typical. OLAP systems automaticallyperform some design tasks, such as selecting which views to materializein order to provide quick response times. OLAP is a good tool for explor-ing the data in a human-driven fashion, when the person has a clearquestion in mind. Data mining is usually computer driven, involvinganalysis of the data to create likely hypotheses that might be of interestto users. Data mining can bring to the forefront valuable and interestingstructure in the data that would otherwise have gone unnoticed.

Figure 8.25 Aggregate data from warranty claims (courtesy of Ubiquiti, Inc.)

0

20

40

60

80

100

Electrical Seating Exterior Engine

CarsTrucksOther




The evolution and principles of data warehouses can be found in Bar-quin and Edelstein [1997], Cataldo [1997], Chaudhuri and Dayal [1997],Gray and Watson [1998], Kimball and Ross [1998, 2002], and Kimballand Caserta [2004]. OLAP is discussed in Barquin and Edelstein [1997],Faloutsos, Matia, and Silberschatz [1996], Harinarayan, Rajaraman, andUllman [1996], Kotidis and Roussopoulos [1999], Nadeau and Teorey[2002 2003], Thomsen [1997], and data mining principles and tools canbe found in Han and Kamber [2001], Makridakis, Wheelwright, andHyndman [1998], Mitchell [1997], The University of Waikato [2005],Witten and Frank [2000], among many others.


187

9CASE Tools for Logical Database Design

atabase design is just one part of the analysis and design phase ofcreating effective business application software (see Figure 9.1), but

it is often the part that is the most challenging and the most critical toperformance. In the previous chapters, we explored the classic ways ofcreating efficient and effective database designs, including ER modelingand the transformation of the ER models into constructs by transforma-tion rules. We also examined normalization, normal forms and denor-malization, and specific topologies used in warehousing, such as starschema. All this information may leave your head spinning!

This chapter focuses on commercially available tools to simplifythese design processes. These computer-aided system engineering, orCASE, tools provide functions that assist in system design. CASE toolsare widely used in numerous industries and domains, such as circuitdesign, manufacturing, and architecture. Logical database design isanother area where CASE tools have proven effective. This chapterexplores the offerings of the major vendors in this space: IBM, ComputerAssociates, and Sybase. Each of these companies offers powerful, feature-rich technology for developing logical database designs and transition-ing them into physical databases you can use.

Although it is impossible to present information on software prod-ucts without some subjectivity and comment, we have sincerelyattempted to discuss the capabilities of these products with minimalproduct bias or critique. Also, it is impossible to describe the features of

D


188 CHAPTER 9 CASE Tools for Logical Database Design

these products in great detail in a chapter of this sort (a user manual ofmany hundred pages could be written describing each), so we have setthe bar slightly lower, with the aim of surveying these products to givethe reader a taste for the capabilities they provide. Further details can beobtained from the manufacturer’s Web sites, which are listed in the Lit-erature Summary at the end of the chapter.

9.1 Introduction to the CASE Tools

In this chapter, we will introduce some of the most popular and power-ful products available for helping with logical database design: IBM’sRational Data Architect, Computer Associate’s AllFusion ERwin DataModeler, and Sybase’s PowerDesigner. These CASE tools help thedesigner develop a well-designed database by walking through a processof conceptual design, logical design and physical creation, as shown inFigure 9.2.

Computer Associates’ AllFusion ERwin Data Modeler has beenaround the longest. A stand-alone product, AllFusion ERwin’s strengthsstem from relatively strong support of physical data modeling, the

Figure 9.1 Business system life cycle (courtesy IBM Corp.)


9.1 Introduction to the CASE Tools 189

broadest set of technology partners, and third-party training. What itdoes it does well, but in recent years it has lagged in some advanced fea-tures. Sybase’s PowerDesigner has come on strong in the past few years,challenging AllFusion ERwin. It has some advantages in reporting andsome advanced features that will be described later in this chapter. IBM’sRational Data Architect is a new product that supplants IBM’s previousproduct, Rational Rose Data Modeler. Its strength lies in strong designchecking; rich integration with IBM’s broad software development plat-form, including products from their Rational, Information Management,and Tivoli divisions; and advanced features that will be described below.

In previous chapters, we discussed the aspects of logical databasedesign that CASE tools help design, annotate, apply, and modify. Theseinclude, for example, ER, and UML modeling, and how this modelingcan be used to develop a logical database design. Within the ER relation-ship design, there are several types of entity definitions and relationshipmodeling (unrelated, one-to-many, and many-to-many). These relation-ships are combined and denormalized into schema patterns known asnormal forms (e.g., 3NF, star schema, snowflake schema). An effectivedesign requires the clear definition of keys, such as the primary key, theforeign key, and unique keys within relationships. The addition of con-straints to limit the usage (and abuses) of the system within reasonablebounds or business rules is also critical. The effective logical design of

Figure 9.2 Database design process



the database will have a profound impact on the performance of the sys-tem, as well as the ease with which the database system can be main-tained and extended.

There are several other CASE products that we will not discuss in thisbook. A few additional products worth investigating include Data-namic’s DeZign for Databases, QDesigner by Quest Software, VisibleAnalyst by Standard, and Embarcadero ER/Studio. The Visual Studio.NET Enterprise Architect edition includes a version of Visio with somedatabase design stencils that can be used to create ER models. The costand function of these tools vary wildly, from open source products upthrough enterprise software that costs thousands of dollars per license.

The full development cycle includes an iterative cycle of understand-ing business requirements; defining product requirements; analysis anddesign; implementation; testing (component, integration, and system);deployment; administration and optimization; and change manage-ment. No single product currently covers that entire scope. Instead,product vendors provide, to varying degrees, suites of products thatfocus on portions of that cycle. CASE tools for database design largelyfocus on the analysis and design, and to a lesser degree testing, of thedatabase model and creation as illustrated in Figure 9.2.

CASE tools provide software that simplifies or automates some of thesteps described in Figure 9.2. Conceptual design includes steps such asdescribing the business entities and functional requirements of the data-base; logical design includes definition of entity relationships and nor-mal forms; physical database design helps transform the logical designinto actual database objects, such as tables, indexes, and constraints. Thesoftware tools provide significant value to database designers by:

1. Dramatically reducing the complexity of conceptual and logicaldesign, both of which can be rather difficult to do well. Thisreduced complexity results in better database design in less timeand with less skill requirements for the user.

2. Automating transformation of the logical design to the physicaldesign (at least the basic physical design). This not only reducestime and skill requirements for the designer, but significantlyremoves chances of manual error in performing the conversionfrom the logical model to the physical data definition language(DDL), which the database server will “consume” (i.e., as input)to create the physical database.

3. Providing the reporting, round trip engineering, and reverse engi-neering that make such tools invaluable in maintaining systems


9.2 Key Capabilities to Watch For 191

over a long period of time. System design can and does evolveover time due to changing and expanding business needs. Also,the people who design the system (sometimes teams of people)may not be the same as those charged with maintaining the sys-tem. The complexity of large systems combined with the need forcontinuous adaptability, virtually necessitates the use of CASEtools to help visualize, reverse engineer, and track the systemdesign over time.

You can find a broader list of available database design tools at theWeb site “Database Answers” (http://www.databaseanswers.com/model-ling_tools.htm), maintained by David Alex Lamb at Queen’s Universityin Kingston, Canada.

9.2 Key Capabilities to Watch For

Design tools should be able to help you with both data modeling andlogical database design. Both processes are important. A good distinctionbetween these appears on the “Database Answers” Web site, cited above.

For data modeling, the question you are asking is: What does theworld being modeled look like? In particular, you are looking for similar-ities between things. Then you identify a “supertype” of thing whichmay have subtypes. For example, Corporate Customers and PersonalCustomers. If, for example, supplier contacts are conceptually differentthings from customer contacts, then the answer is that they should bemodeled separately. On the other hand, if they are merely subsets of thesame thing, then treat them as the same thing.

For database design, you are answering a different question: Howcan I efficiently design a database that will support the functions of aproposed application or Web site? The key task here is to identify simi-larities between entities so that you can integrate them into the sametable, usually with a “Type” indicator. For example, a Customer table,which combines all attributes of both Corporate and Personal Custom-ers. As a result, it is possible to spend a great deal of time breaking thingsout when creating a Data Model, and then collapsing them backtogether when designing the corresponding database.

Support for programmable and physical design attributes with adesign tool can also expand the value a tool provides. In database terms,aspects to watch for will include support for indexes, uniqueness, trig-gers, and stored procedures.



The low-end tools (selling for less than US$100 or available as opensource) provide the most basic functionality for ER modeling. Thehigher end products provide the kinds of support needed for seriousproject design, such as:

• Complete round trip engineering

• UML design

• Schema evolution; change management

• Reverse engineering of existing systems

• Team support, allowing multiple people to work on the sameproject concurrently

• Integration with Eclipse and .NET and other tooling products

• Component and convention reuse (being able to reuse namingstandard, domain, and logical models over multiple designprojects)

• Reusable assets (e.g., extensibility, template)

• Reporting

9.3 The Basics

All of the products in question provide strong, easy to use functions forboth data modeling and database design. All of these products providethe ability to graphically represent ER relationships. These tools alsoprovide transformation processes to map from an ER model into anSQL design (DDL), using the transformation types described earlier inChapter 5:

• Transform each entity into a table containing the key and nonkeyattributes of the entity

• Transform every many-to-many binary or binary recursive rela-tionship into a relationship table with the keys of the entities andthe attributes of the relationship

• Transform every ternary or higher-level n-ary relationship into arelationship table

Similarly these tools produce the transformation table typesdescribed in Chapter 5:


9.3 The Basics 193

• An entity table with the same information content as the originalentity

• An entity table with the embedded foreign key of the parententity

• A relationship table with the foreign keys of all the entities in therelationship

Chapter 5 also described rules for null transformations that mustapply, and the CASE tools typically enforce these.

These CASE tools also help with the modeling of normal forms anddenormalization to develop a true physical schema for your database, asdescribed in Chapter 5. The tools provide graphical interfaces for physi-cal database design as well as basic modeling of uniqueness, constraints,and indexes. Figure 9.3 shows an example of the IBM Rational DataArchitect’s GUI for modeling ERs. Figure 9.4 shows a similar snapshot ofthe interface for Computer Associate’s AllFusion ERwin Data Modeler.

Figure 9.3 Rational Data Architect ER modeling (courtesy IBM Rational Division)



After creating an ER model, the CASE tools enable easy modificationof the model and its attributes through graphical interfaces. An exampleis shown below in Figure 9.5 with IBM’s Rational Data Architect, illus-trating attribute editing. Of these CASE tools, Rational Data Architecthas perhaps the most useful UML modeling function for data modelingand design. Its predecessor, Rational Rose Data Modeler, was the indus-try’s first UML-based data modeler, and IBM has continued its leadershipin this area with Rational Data Architect. UML provides a somewhatricher notation than information engineering (IE) entity-relationshipdiagram (ERD) notation, particularly for conceptual and logical datamodeling. However, the IE-ERD notation is older and more commonlyused. One of the nice aspects of Rational Data Architect is the ability towork with either UML or IE notation.

Figure 9.6 shows the AllFusion ERwin screen for defining the cardi-nality of entity relationships. It is worth noting that many relationshipsdo not need to enter this dialog at all.

Figure 9.4 AllFusion ERwin Data Modeler ER modeling (picture from Computer Asso-ciates, http://agendas.ca.com/Agendas/Presentations/AFM54PN.pdf)


9.3 The Basics 195

Figure 9.5 Property editing with IBM Rational Data Architect (courtesy IBM RationalDivision)

Figure 9.6 Specifying one-to-many relationships with ERwin (courtesy ComputerAssociates)



9.4 Generating a Database from a Design

To really take your design to the next level (i.e., a practical level) you willneed a good design tool that can interact with your specific databaseproduct to actually create the Data Definition Language (DDL) and asso-ciated scripts or commands to create and modify the basic database foryou. For instance, using the example of Chapter 7, we have modeled anER model containing the sales relationships shown in Table 9.1.

The CASE tools will automatically generate the required scripts,including the DDL specification to create the actual database, and willprovide you with an option to apply those changes to an actual data-base, as follows:

create table customer (cust_no char(6), job_title varchar(256), primary key (cust_no), foreign key (job_title) references job on delete set null on update cascade);

create table job (job_title varchar(256), primary key (job_title));

create table order (order_no char(9), cust_no char(6) not null, primary key (order_no), foreign key (cust_no) references customer on delete cascade on update cascade);

Table 9.1 ER Model Containing Sales Relationships

ER Construct FDs

Customer(many): Job(one) cust-no -> job-title

Order(many): Customer(one) order-no -> cust-no

Salesperson(many): Department(one) sales-name -> dept-no

Item(many): Department(one) item-no -> dept-no

Order(many): Item(many): Salesperson(one) order-no,item-no->sales-name

Order(many): Department(many): Salesperson(one) order-no,dept-no-> sales-name


9.4 Generating a Database from a Design 197

create table salesperson (sales_name varchar(256), dept_no char(2), primary key (sales_name), foreign key (dept_no) references department on delete set null on update cascade);

create table department (dept_no char(2), primary key (dept_no));

create table item (item_no char(6), dept_no char(2), primary key (item_no), foreign key (dept_no) references department on delete set null on update cascade);

create table order_item_sales (order_no char(9), item_no char(6), sales_name varchar(256) not null, primary key (order_no, item_no), foreign key (order_no) references order on delete cascade on update cascade, foreign key (item_no) references item on delete cascade on update cascade, foreign key (sales_name) references salesperson on delete cascade on update cascade);

create table order_dept_sales (order_no char(9), dept_no char(2), sales_name varchar(256) not null, primary key (order_no, dept_no), foreign key (order_no) references order on delete cascade on update cascade, foreign key (dept_no) references department on delete cascade on update cascade, foreign key (sales_name) references salesperson on delete cascade on update cascade);

It is worth noting that with all the CASE tools we discuss here, theconversion of the logical design to the physical design is quite rudimen-tary. These tools help create basic database objects, such as tables and, insome cases, indexes. However, the advanced features of the databaseserver are often not supported—where they are supported, the CASE toolis usually behind by two or three software releases. Developing advanced



physical design features, such as multidimensional clustering or materi-alized views, is far beyond the capabilities of the logical design tools weare discussing. Advanced physical database design is often highly depen-dant on data density and data access patterns. One feature of RationalData Architect that stands out is that it provides linkages with the auto-matic computing (self-managing) capabilities within DB2 to providesemi-automated selection of advanced physical design attributes.

Figure 9.7 shows an example with ERwin schema generation, gener-ating the DB2 DDL directly from the ER model designed within ERwin.

Other very important capabilities shared by these tools include theability to reverse-engineer existing databases (for which you may nothave an existing ER or physical UML model), and the ability to automat-ically materialize the physical model or incremental changes of a modelonto a real database. This capability enables you to synchronize yourdatabase design with a real database as you make changes. This capacity

Figure 9.7 ERwin schema generation for a DB2 database (picture from IBM: http://www.redbooks.ibm.com/abstracts/redp3714.html?Open)


9.5 Database Support 199

is massively useful for incremental application and database develop-ment, as well as for incremental maintenance.

9.5 Database Support

All of these products support a large array of database types. Certainly,all of the major database vendors are supported by each of these prod-ucts (i.e., DB2 UDB, DB2 zOS, Informix IDS, Oracle, SQL Server), and amuch larger set is supported through ODBC. However, what really mat-ters most to the application developer is whether the database he or sheis programming toward is directly supported by the CASE design tool.Database support is not equal between these products. Also, very signifi-cantly, each database product will have unique features of its own (suchas reverse scan indexes, informational constraints, and so forth) whichare not standard. One of the qualitative attributes of a good databasedesign tool is whether it distinguishes and supports the unique exten-sions of individual database products. Each of the products has a stronghistory for doing so: in descending order, AllFusion ERwin Data Mod-eler, Sybase PowerDesigner, and Rational Data Architect. Notably, IBM’sRational Data Architect has a somewhat smaller range of supported data-bases than the other products, though it does support all the major data-base platforms. However, Rational Data Architect can be expected overtime to have the tightest integration with the DB2 and Informix fami-lies, since all of these products are developed by IBM. Database designers

Figure 9.8 DBMS selection in AllFusion ERwin Data Modeler (picture from ComputerAssociates: http://iua.org.uk/conference/Autumn%202003/Ruth%20Wunderle.ppt#9)



are advised to investigate the level of support provided by a CASE toolfor the database being developed toward, to ensure the level of supportis adequate. Figure 9.8 shows an example of database server selectionwith AllFusion ERwin Data Modeler.

9.6 Collaborative Support

All three of these products are designed for collaborative development,so that multiple developers can work together to design portions of adatabase design, either supporting different applications or collaboratingon the same portions. These collaboration features fall into twodomains:

1. Concurrency control. This form of collaboration ensures thatmultiple designers do not modify the same component of thedatabase design at the same time. This is comparable in softwaredevelopment terms to a source code control system.

2. Merge and collaboration capabilities. This form of collabo-ration enables designers to combine designs or merge their latestchanges into a larger design project. These merging capabilitiescompare components between what is already logged into the

Figure 9.9 Merge process with PowerDesigner (courtesy of Sybase)


9.7 Distributed Development 201

project system and what a designer wishes to add or modify. TheCASE tools locate the conflicting changes and visually identifythem for the designer, who can decide which changes should bekept, and which discarded in favor of the model currently definedin the project.

Figure 9.9 shows the Sybase PowerDesigner merge GUI, which iden-tifies significant changes between the existing schema and the newschema being merged. In particular, notice how the merge tool has iden-tified a change in Table_1 Column_1, which has changed base types.The tool also found that Table_2 and Table_3, which exist in the merg-ing design, were not present in the base design. AllFusion ERwin DataModeler and Rational Data Architect have similar capabilities for merg-ing design changes.

9.7 Distributed Development

Distributed development has become a fact of life for large enterprisedevelopment teams, in which groups of developers collaborate fromgeographically diverse locations to develop a project. The phenomenonis not only true across floors of a building, or between sites in a city, butnow across states, provinces, and even countries. In fact, outsourcing ofsoftware development has become a tour de force, with many analystsprojecting that the average enterprise will ultimately outsource 60% ofapplication work, shifting aspects of project development to locationswith cheaper labor. As the META Group said in its September 16, 2004Offshore Market Milieu report, “With global resources costing one-third toone-fifth that of American employees—without accounting for hidden costsand having higher process discipline, offshore strategies now pervade NorthAmerican IT organizations.”

Therefore, developers of database software working in a distributedcollaborative environment need to consider the collaborative and dis-tributed qualities of CASE tools for database design. The trend towardscollaborative development and distributed development shows no signof slowing; rather, it is on the increase. In the existing space, IBM’s Ratio-nal MultiSite software, shown in Figure 9.10, allows the best administra-tion across geographically diverse locations for replicating project soft-ware and data and subsequently merging the results. Rational MultiSiteis a technology layered on top of Rational ClearCase and Rational Clear-



Quest (development and source code control products) to allow localreplicas of Rational ClearCase and Rational ClearQuest repositories.Rational MultiSite also handles the automatic synchronization of thereplicas. This is particularly useful for companies with distributed devel-opment who wish to have fast response times for their developers viaaccess to local information, and such replication is often an absoluterequirement for geographically distributed teams.

9.8 Application Life Cycle Tooling Integration

The best CASE tools for database design are integrated with a completesuite of application development tools that cover the software develop-ment life cycle. This allows the entire development team to work froman integrated tool platform, rather than the data modelers being off intheir own world. Only the largest vendors offer this, and in fact truetooling integration across the development life cycle is somewhat rare.This solution is, in a very real way, the philosopher’s stone of develop-ment infrastructure vendors. All the vendors who produce softwaredevelopment platforms have been working to develop this breadth dur-ing the past two decades. The challenge is elusive simply because it ishard to do well. The three companies we are discussing here all have

Figure 9.10 IBM’s Rational MultiSite software for massively distributed softwaremanagement (courtesy IBM Rational Division)


9.8 Application Life Cycle Tooling Integration 203

broad offerings, and provide some degree of integration with their data-base design CASE technology.

For Computer Associates, the AllFusion brand is a family of develop-ment life cycles tools. It is intended to cover designing, building, deploy-ing and managing eBusiness applications. Sybase also has a broad prod-uct suite, and their strength is in collaborative technology. From adesign perspective, the ability to plug Sybase PowerDesigner into theirpopular Sybase Power Builder application development tooling is a verynice touch, as seen in Figure 9.11. The IBM tooling is clearly the broadestbased, and their new IBM Software Development Platform, which isbuilt heavily but not exclusively from their Rational products, coverseverything from requirements building to portfolio management, sourcecode control, architectural design constraints, automated testing, perfor-mance analysis, and cross site development. A representation of the IBMSoftware Development Platform is shown in Figure 9.12.

Figure 9.11 Sybase PowerDesigner plug-in to Sybase PowerBuilder (picture fromhttp://www.pbugg.de/docs/1, Berndt Hambock)



9.9 Design Compliance Checking

With all complex designs, and particularly when multiple designers areinvolved, it can be very hard to maintain the integrity of the systemdesign. The best software architects and designers grapple with this bydefining design guidelines and rules. These are sometimes called “designpatterns” and “anti-patterns.” A design pattern is a design principle thatis expected to be generally adhered to within the system design. Con-versely, an anti-pattern is precisely the opposite. It represents flaws inthe system design that can occur either through violation of the designpatterns or through explicit definition of an anti-pattern. The enforce-ment of design patterns and anti-patterns is an emerging attribute of thebest CASE tools for systems design in general, and database design inparticular. Figure 9.13 shows an example of the interface used in theRational Data Architect for compliance checking, which scans the sys-tem to enforce design patterns and check for anti-patterns. Some degreeof support for design pattern and anti-pattern checking exists in All-Fusion ERwin Data Modeler and Sybase PowerDesigner, as well. Thecompliance checking in IBM’s Rational products is the most mature in

Figure 9.12 IBM Software Development Platform (courtesy IBM Rational Division)

Model Services (UML2 ext, other Meta-Models, Code Gen APIs, …)

J2EE, Web Services, UML2 Models

Hyades

CM, Merge, Traceability…

Eclipse Core

GEF

WebSphereBusiness

IntegrationModeler

& Monitor

RationalSoftwareModeler

RationalSoftwareArchitect

RationalApplcationDeveloper

Rational DataArchitect

Rational WebDeveloper

RationalFunctional &

ManualTester

RationalPerformance

Tester

TivoliConfiguration

Manager

TivoliMonitoring

EMF JDT/CDT TeamEcl

ipse

ArchitectAnalyst TesterDeveloper DeploymentManager


9.9 Design Compliance Checking 205

general, with the notion of design patterns and anti-patterns being a keyphilosophical point for the IBM’s Rational family of products. Someexamples of things these compliance checkers will scan for include:

• Complete round trip engineering

• Design and normalization

— Discover 1st, 2nd, and 3rd normalization

• Index and storage

— Check for excessive indexing

• Naming standards

• Security compliance

• Sarbanes-Oxley compliance

— Check for valid data model and rules

• Model syntax checks

Figure 9.13 Modeling and database compliance checking (courtesy IBM RationalDivision)



9.10 Reporting

Reporting capabilities are a very important augmentation of the designcapabilities in CASE tools for database design. These reports allow you toexamine your ER and UML modeling and database schema in bothgraphical and textual formats. The Sybase products have a superb repu-tation for reporting features; their products enable you to generatereports in common formats like Microsoft Word. Reporting can includeboth the modeling and the annotations that the designer has added. Itcan cover physical data models, conceptual data models, object-oriented

Figure 9.14 Reporting features with Sybase PowerDesigner (courtesy Sybase)


9.11 Modeling a Data Warehouse 207

models (UML), business process models, and semi-structured data usingeXtensible Markup Language (XML). Notice in Figure 9.14 how thefourth page, which contains graphics, can be oriented in landscapemode, while the remaining pages are kept in portrait mode. AllFusionERwin and Rational Data Architect also provide rich reporting features,though Sybase PowerDesigner has the richest capabilities.

9.11 Modeling a Data Warehouse

In Chapter 8 we discussed the unique design considerations required fordata warehousing and decision support. Typically, warehouses aredesigned to support complex queries that provide analytic analysis ofyour data. As such, they exploit different schema topology models, suchas star schema and horizontal partitioning. They typically exploit dataviews and materialized data views, data aggregation, and multidimen-sional modeling far more extensively than other operational and trans-actional databases.

Traditionally, warehouses have been populated with data that isextracted and transformed from other operational databases. However,more and more companies are moving to consolidate system resourcesand provide real-time analytics by either feeding warehouses data innear-real-time (i.e., with a few minutes latency) or entirely merging theirtransactional data stores with their analytic warehouses into a singleserver or cluster. These trends are known as “active data warehousing,”and pose even more complex design challenges. There is a vast need forCASE tooling in this space.

Sybase offers a CASE tool known as Sybase Industry Warehouse Stu-dio (IWS). Sybase IWS is really a set of industry-specific, prepackagedwarehouses that require some limited customization. Sybase IWS toolingprovides a set of wizards for designing star schemas, dimensional tables,denormalization, summarization, and partitioning; as usual, the Sybasetools are strong on reporting facilities.

The industry domains covered by ISW are fairly reasonable—theyinclude IWS for Media, IWS for Healthcare, IWS for Banking, IWS forCapital Markets, IWS for Life Insurance, IWS for Telco, IWS for CreditCards, IWS fro P&C Insurance, and IWS for CRA.

IBM’s DB2 Cube Views (shown in Figure 9.15) provides OLAP andmultidimensional modeling. DB2 Cube Views allows you to create meta-data objects to dimensionally model OLAP structures and relationaldata. The graphical interface allows you to create, manipulate, import,or export cube models, cubes, and other metadata objects.



Sybase IWS uses standard database design constructs that port tomany database systems, such as DB2 UDB, Oracle, Microsoft SQL Server,Sybase Adaptive Server Enterprise, and Sybase IQ. In contrast, IBM’s DB2Cube Views is designed specifically to exploit DB2 UDB. The advantageof DB2 Cube View is that it can exploit product-specific capabilities inthe DB2 database that may not be generally available in other databases.Some examples of this include materialized query tables (precomputedaggregates and cubes), multidimensional clustering, triggers, functionaldependencies, shared-nothing partitioning, and replicated MQTs. SybaseIWS dependence on the lowest common denominator database featureprovides flexibility when selecting the database server but may proveextremely limiting for even moderately sized marts and warehouses (i.e.,larger than 100 GB), where advanced access and design features becomecritical.

To summarize and contrast, Sybase offers portable warehouse designsthat require minimal customization and are useful for smaller systems,and DB2 Cube View provides significantly richer and more powerfulcapabilities, which fit larger systems, require more customization, andnecessitate DB2 UDB as the database server.

Figure 9.15 DB2 Cube Views interface (courtesy IBM Rational Division)


9.12 Semi-Structured Data, XML 209

AllFusion ERwin Data Modeler has basic support to model OLAP andmultidimensional databases, but does not have the same richness oftooling and wizards that the other companies offer to actually substan-tially simplify the design process of these complex systems.

9.12 Semi-Structured Data, XML

XML (eXtensible Markup Language) is a data model consisting of nodesof several types linked together with ordered parent/child relationshipsto form a hierarchy. One representation of that data model is textual—there are others that are not text! XML has increasingly become a dataformat of choice for data sharing between systems. As a result, increas-ing volumes of XML data are being generated.

While XML data has some structure it is not a fully-structured for-mat, such as the table definitions that come from a fully-structuredmodeling using ER with IE or UML. XML is known in the industry as asemi-structured format: It lacks the strict adherence of schema thatstructured data schemas have, yet it has some degree of structure whichdistinguishes it from completely unstructured data, such as image andvideo data.

Standards are forming around XML to allow it to be used for data-base style design and query access. The dominant standards are XMLSchema and XML Query (also known as XQuery). Also worth noting isOMG XMI standard, which defines a standard protocol for defining astructured format for XML interchange, based on an object model. Pri-marily for interfacing reasons, UML tools such as MagicDraw have takenXMI seriously and have therefore become the preferred alternatives inthe open source space.

XML data is text-based, and self-describing (meaning that XMLdescribed the type of each data point, and defines its own schema). XML

Figure 9.16 An XML schema for a recipe



has become popular for Internet-based data exchange based on thesequalities as well as being “well-formed.” Well-formed is a computer sci-ence term, implying XML’s grammar is unambiguous through the use ofmandated structure that guarantees terms are explicitly prefixed andclosed. Figure 9.16 shows the conceptual design of a semi-structureddocument type named “recipe.” Figure 9.17 shows an XML documentfor a hot dog recipe. Notice that the file is completely textual.

IBM Rational Data Architect and Sybase PowerDesigner have takenthe lead in being early adopters of XML data modeling CASE tools. Bothproducts support the modeling of semi-structured data through XMLand provide graphical tooling for modeling XML hierarchies.

Figure 9.17 An XML document for a hot dog


9.13 Summary 211

9.13 Summary

There are several good CASE tools available for computer-assisted data-base design. This chapter has touched on some of the features for threeof the leading products: IBM Rational Data Architect, Computer Associ-ates AllFusion ERwin Data Modeler, and Sybase PowerDesigner. Eachprovides powerful capabilities to assist in developing ER models andtransforming those models to logical database designs and physicalimplementations. All of these products support a wide range of databasevendors, including DB2 UDB, DB2 zOS, Informix Data Server (IDS), Ora-cle, SQL Server, and many others through ODBC support. Each producthas different advantages and strengths. The drawbacks a product mayhave now are certain to be improved over time, so discussing the relativemerits of each product in a book can be somewhat of an injustice to aproduct that will deliver improved capabilities in the near future.

At the time of authoring this text, Computer Associate’s AllFusionERwin Data Modeler had advantages as a mature product with vast data-base support. The AllFusion products don’t have the advanced complexfeature support for XML and warehousing/analytics, but what they dosupport they do well. Sybase PowerDesigner sets itself apart for superiorreporting capabilities. IBM’s Rational Data Architect has the best integra-tion with a broad software application development suite of tooling, andthe most mature use of UML. Both the Sybase and IBM tools are blazingnew ground in their early support for XML semi structured data and forCASE tools for warehousing and OLAP. The best products provide thehighest level of integration into a larger software development environ-ment for large-scale collaborative, and possible geographically diverse,development. These CASE tools can dramatically reduce the time, cost,and complexity of developing, deploying, and maintaining a databasedesign.


Current logical database design tools can be found in manufacturer Websites [Database Answers, IBM Rational Software, Computer Associates,Sybase PowerDesigner, Directroy of Data Modeling Resources, Objects byDesign, Understanding relational databases: referential integrity, andWidom].


213

Appendix

The Basics of SQL

Structured Query Language, or SQL, is the ISO-ANSI standard data defi-nition and manipulation language for relational database managementsystems. Individual relational database systems use slightly different dia-lects of SQL syntax and naming rules, and these differences can be seenin the SQL user guides for those systems. In this text, as we explore eachstep of the logical and physical design portion of the database life cycle,many examples of database table creation and manipulation make use ofSQL syntax.

Basic SQL use can be learned quickly and easily by reading thisappendix. The more advanced features, such as statistical analysis andpresentation of data, require additional study and are beyond the reachof the typical nonprogrammer. However, the DBA can create SQL viewsto help nonprogrammers set up repetitive queries; other languages, suchas forms, are being commercially sold for nonprogrammers. For theadvanced database programmer, embedded SQL (in C programs, forinstance) is widely available for the most complex database applications,which need the power of procedural languages.

This appendix introduces the reader to the basic constructs for theSQL-99 (and SQL-92) database definition, queries, and updates througha sequence of examples with some explanatory text. We start with a def-inition of SQL terminology for data types and operators. This is followedby an explanation of the data definition language (DDL) constructsusing the “create table” commands, including a definition of the various


214 Appendix: The Basics of SQL

types of integrity constraints, such as foreign keys and referential integ-rity. Finally, we take a detailed look at the SQL-99 data manipulationlanguage (DML) features through a series of simple and then more com-plex practical examples of database queries and updates.

The specific features of SQL, as implemented by the major vendorsIBM, Oracle, and Microsoft, can be found in the references at the end ofthis appendix.

A.1 SQL Names and Operators

This section gives the basic rules for SQL-99 (and SQL-92) data types andoperators.

• SQL names: Although these have no particular restrictions, somevendor-specific versions of SQL do have some restrictions. Forexample, in Oracle, names of tables and columns (attributes) canbe up to 30 characters long, must begin with a letter, and caninclude the symbols (a-z, 0-9,_,$,#). Names should not duplicatereserved words or names for other objects (attributes, tables,views, indexes) in the database.

• Data types for attributes: character, character varying, numeric,decimal, integer, smallint, float, double precision, real, bit, bitvarying, date, time, timestamp, interval.

• Logical operators: and, or, not, ().

• Comparison operators: =, <>, <, <=, >, >=, (), in, any, some, all,between, not between, is null, is not null, like.

• Set operators:

— union: combines queries to display any row in each subquery

— intersect: combines queries to display distinct rows commonto all subqueries

— except: combines queries to return all distinct rows returnedby the first query, but not the second (this is “minus” or “dif-ference” in some versions of SQL)

• Set functions: count, sum, min, max, avg.

• Advanced value expressions: CASE, CAST, row value constructors.CASE is similar to CASE expressions in programming languages,in which a select command needs to produce different resultswhen there are different values of the search condition. The CAST


A.2 Data Definition Language (DDL) 215

expression allows you to convert data of one type to a differenttype, subject to some restrictions. Row value constructors allowyou to set up multiple column value comparisons with a muchsimpler expression than is normally required in SQL (see Meltonand Simon [1993] for detailed examples).

A.2 Data Definition Language (DDL)

The basic definitions for SQL objects (tables and views) are:

• create table: defines a table and all its attributes

• alter table: add new columns, drop columns, or modifies existingcolumns in a table

• drop table: deletes an existing table

• create view, drop view: defines/deletes a database view (see SectionA.3.4 )

Some versions of SQL also have create index/drop index, whichdefines/deletes an index on a particular attribute or composite of severalattributes in a particular table.

The following table creation examples are based on a simple data-base of three tables: customer, item, and order. (Note that we puttable names in boldface throughout the book for readability.)

create table customer (cust_num numeric, cust_name char(20), address varchar(256), credit_level numeric, check (credit_level >= 1000), primary key (cust_num));

Note that the attribute cust_num could be defined as “numeric notnull unique” instead of explicity defined as the primary key, since theyboth have the same meaning. However, it would be redundant to haveboth forms in the same table definition. The check rule is an importantintegrity constraint that tells SQL to automatically test each insertion ofcredit_level value for something greater than or equal to 1000. If not, anerror message should be displayed.



create table item (item_num numeric, item_name char(20), price numeric, weight numeric, primary key (item_num));

create table order (ord_num char(15), cust_num numeric not null, item_num numeric not null, quantity numeric, total_cost numeric, primary key (ord_num), foreign key (cust_num) references customer on delete no action on update cascade, foreign key (item_num) references item on delete no action on update cascade);

SQL, while allowing the above format for primary key and foreignkey, recommends a more detailed format, shown below, for table order:

constraint pk_constr primary key (ord_num),constraint fk_constr1 foreign key (cust_num) references customer (cust_num) on delete no action on update cascade,constraint fk_constr2 foreign key (item_num) references item (item_num) on delete no action on update cascade);

in which pk_constr is a primary key constraint name, and fk_constr1and fk_constr2 are foreign key constraint names. The word “constraint”is a keyword, and the object in parentheses after the table name is thename of the primary key in that table referenced by the foreign key.

The following constraints are common for attributes defined in theSQL create table commands:

• Not null: a constraint that specifies that an attribute must have anonnull value.

• Unique: a constraint that specifies that the attribute is a candidatekey; that is, that it has a unique value for every row in the table.Every attribute that is a candidate key must also have the con-straint not null. The constraint unique is also used as a clause to


A.2 Data Definition Language (DDL) 217

designate composite candidate keys that are not the primary key.This is particularly useful when transforming ternary relation-ships to SQL.

• Primary key: the primary key is a set of one or more attributes that,when taken collectively, enables us to uniquely identify an entityor table. The set of attributes should not be reducible (see Section6.1.2). The designation primary key for an attribute implies thatthe attribute must be not null and unique, but the SQL keywordsNOT NULL and UNIQUE are redundant for any attribute that ispart of a primary key, and need not be specified in the create tablecommand.

• Foreign key: the referential integrity constraint specifies that a for-eign key in a referencing table column must match an existingprimary key in the referenced table. The references clause speci-fies the name of the referenced table. An attribute may be both aprimary key and a foreign key, particularly in relationship tablesformed from many-to-many binary relationships or from n-aryrelationships.

Foreign key constraints are defined for deleting a row on the refer-enced table and for updating the primary key of the referenced table.The referential trigger actions for delete and update are similar:

• on delete cascade: the delete operation on the referenced table“cascades” to all matching foreign keys.

• on delete set null: foreign keys are set to null when they match theprimary key of a deleted row in the referenced table. Each foreignkey must be able to accept null values for this operation to apply.

• on delete set default: foreign keys are set to a default value whenthey match the primary key of the deleted row(s) in the referencetable. Legal default values include a literal value, “user,” “systemuser,” or “no action.”

• on update cascade: the update operation on the primary key(s) inthe referenced table “cascades” to all matching foreign keys.

• on update set null: foreign keys are set to null when they match theold primary key value of an updated row in the referenced table.Each foreign key must be able to accept null values for this opera-tion to apply.

• on update set default: foreign keys are set to a default value whenthey match the primary key of an updated row in the reference



table. Legal default values include a literal value, “user,” “systemuser,” or “no action.”

The cascade option is generally applicable when either the manda-tory existence constraint or the ID dependency constraint is specified inthe ER diagram for the referenced table, and either set null or set default isapplicable when optional existence is specified in the ER diagram for thereferenced table (see Chapters 2 and 5).

Some systems, such as DB2, have an additional option on delete orupdate, called restricted. Delete restricted means that the referenced tablerows are deleted only if there are no matching foreign key values in thereferencing table. Similarly, update restricted means that the referencedtable rows (primary keys) are updated only if there are no matching for-eign key values in the referencing table.

Various column and table constraints can be specified as deferrable(the default is not deferrable), which means that the DBMS will deferchecking this constraint until you commit the transaction. Often this isrequired for mutual constraint checking.

The following examples illustrate the alter table and drop table com-mands. The first alter table command modifies the cust_name data typefrom char(20) in the original definition to varchar(256). The second andthird alter table commands add and drop a column, respectively. The addcolumn option specifies the data type of the new column.

alter table customer modify (cust_name varchar(256));

alter table customeradd column cust_credit_limit numeric;

alter table customerdrop column credit_level;

drop table customer;

A.3 Data Manipulation Language (DML)

Data manipulation language commands are used for queries, updates,and the definition of views. These concepts are presented through aseries of annotated examples, from simple to moderately complex.


A.3 Data Manipulation Language (DML) 219

A.3.1 SQL Select Command

The SQL select command is the basis for all database queries. The follow-ing series of examples illustrates the syntax and semantics of the selectcommand for the most frequent types of queries in everyday businessapplications. To illustrate each of the commands, we assume the follow-ing set of data in the database tables:

customer table

cust_num cust_name address credit_level001 Kirk Enterprise 10002 Spock Enterprise 9003 Scotty Enterprise 8004 Bones Enterprise 8005 Gorn PlanetoidArena 1006 Khan CetiAlphaFive 2007 Uhura Enterprise 7008 Chekov Enterprise 6009 Sulu Enterprise 6

item table

item_num item_name price weight125 phaser 350 2137 beam 1500 250143 shield 4500 3000175 fusionMissile 2750 500211 captainsLog 50 2234 starShip 25000 15000356 sensor 245 15368 intercom 1200 75399 medicalKit 75 3

order table

ord_num cust_num item_num quantity total_cost10012 005 125 2 70010023 006 175 20 5500010042 003 137 3 450010058 001 211 1 5010232 007 368 1 120010266 002 356 50 1225010371 004 399 10 75011070 009 143 1 450011593 008 125 2 70011775 006 125 3 105012001 001 234 1 25000



Basic Commands

1. Display the entire customer table. The asterisk (*) denotes thatall records from this table are to be read and displayed.

select * from customer;

This results in a display of the complete customer table (asshown above).

2. Display customer name, customer number, and credit level for allcustomers on the Enterprise who have a credit level greater than7. Order by ascending sequence of customer name (the order-byoptions are asc, desc). Note that the first selection condition isspecified in the where clause and succeeding selection conditionsare specified by and clauses. Character type data and other non-numeric data are placed inside single quotes, but numeric data isgiven without quotes. Note that useful column names can be cre-ated by using formatting commands (which are not shown here).

select cust_name, cust_num, credit_level from customerwhere address = 'Enterprise'and credit_level > 7order by cust_name asc;

customer name customer number credit level----------------------------------------------Bones 004 8Kirk 001 10Scotty 003 8Spock 002 9

3. Display all customer and order item information (all columns),but omit customers with a credit level greater than 6. In thisquery, the from clause shows the definition of abbreviations c ando for tables customer and order, respectively. The abbreviationscan be used anywhere in the query to denote their respectivetable names. This example also illustrates a join between tablescustomer and order, using the common attribute namecust_num as shown in the where clause. The join finds matchingcust_num values from the two tables and displays all the data



from the matching rows, except where the credit number is 7 orabove, and ordered by customer number.

select c.*, o.*from customer as c, order as owhere c.cust_num = o.cust_numand c.credit_level < 7order by cust_no asc;

cust. no. cust. name address credit level order no. item no. qty total cost

-------------------------------------------------------------------------------------------005 Gorn PlanetoidArena 1 10012 125 2 700006 Khan CetiAlphaFive 2 11775 125 3 1050006 Khan CetiAlphaFive 2 10023 175 20 55000008 Chekov Enterprise 6 11593 125 2 700009 Sulu Enterprise 6 11070 143 1 4500

Union and Intersection Commands

1. Which items were ordered by customer 002 or customer 007?This query can be answered in two ways, one with a set operator(union) and the other with a logical operator (or).

select item_num, item_name, cust_num, cust_name from orderwhere cust_num = 002 unionselect item_num, item_name, cust_num, cust_name from orderwhere cust_num = 007;

select item_num, item_name, cust_num, cust_name from orderwhere (cust_num = 002 or cust_num = 007);

item number item name customer no. customer name-----------------------------------------------------356 sensor 002 Spock368 intercom 007 Uhura

2. Which items are ordered by both customers 005 and 006? All therows in table order that have customer 005 are selected and com-pared to the rows in order that have customer 006. Rows fromeach set are compared with all rows from the other set, and those



that have matching item numbers have the item numbers dis-played.

select item_num, item_name, cust_num, cust_name from order where cust_num = 005intersectselect item_num, item_name, cust_num, cust_name from orderwhere cust_num = 006;

item number item name customer no customer name-----------------------------------------------------125 phaser 005 Gorn125 phaser 006 Khan

Aggregate Functions

1. Display the total number of orders. This query uses the SQL func-tion count to count the number of rows in table order.

select count(*) from order;

count(order)--------------------11

2. Display the total number of customers actually placing orders foritems. This is a variation of the count function and specifies thatonly the distinct number of customers is to be counted. The dis-tinct modifier is required because duplicate values of customernumbers are likely to be found, since a customer can order manyitems and will appear in many rows of table order.

select count (distinct cust_num) from order;

distinct count(order)-------------------------9



3. Display the maximum quantity of an order of item number 125.The SQL maximum function is used to search the table order,select rows where the item number is 125, and display the maxi-mum value of quantity from the rows selected.

select max (quantity) from orderwhere item_num = 125;

max(quantity)-----------------3

4. For each type of item ordered, display the item number and totalorder quantity. Note that item_num and item_name in the selectline must be in a group by clause. In SQL, any attribute to be dis-played in the result of the select command must be included in agroup by clause when the result of an SQL function is also to bedisplayed. The group by clause results in a display of the aggregatesum of quantity values for each value of item_num anditem_name. The aggregate sums will be taken over all rows withthe same value of item_num.

select item_num, item_name, sum(quantity) from ordergroup by item_num, item_name;

item number item name sum(quantity)-----------------------------------------125 phaser 7137 beam 3143 shield 1175 fusionMissile 20211 captainsLog 1234 starShip 1356 sensor 50368 intercom 1399 medicalKit 10

5. Display item numbers for all items ordered more than once. Thisquery requires the use of the group by and having clauses to displaydata that is based on a count of rows from table order having thesame value for attribute item_num.



select item_num, item_namefrom ordergroup by item_num, item_namehaving count(*) >1;

item number item name----------------------------125 phaser

Joins and Subqueries

1. Display customer names of the customers who order item num-ber 125. This query requires a join (equijoin) of tables customerand order to match customer names with item number 125.Including the item_num column in the output verifies that youhave selected the item number you want. Note that the defaultordering of output is typically ascending by the first column.

select c.cust_name, o.item_numfrom customer as c, order as o where c.cust_num = o.cust_numand item_num = 125;

customer name item number---------------- ---------------Chekov 125Gorn 125Khan 125

This query can be equivalently performed with a subquery (some-times called a nested subquery) with the following format. Theselect command inside the parentheses is a nested subquery and isexecuted first, resulting in a set of values for customer number(cust_num) selected from the order table. Each of those values iscompared with cust_num values from the customer table, andmatching values result in the display of customer name from thematching row in the customer table. This is effectively a joinbetween tables customer and order with the selection condi-tion of item number 125.



select cust_name, order_numfrom customer where cust_num in (select cust_num from order where item_num = 125);

2. Display names of customers who order at least one item pricedover 1000. This query requires a three-level nested subquery for-mat. Note that the phrases in, = some, and = any in the whereclauses are often used as equivalent comparison operators; seeMelton and Simon [1993].

select c.cust_namefrom customer as cwhere c.cust_num in (select o.cust_num from order as o where o.item_num = any (select i.item_num from item as i where i.price > 1000));

customer name------------------KhanKirkScottySuluUhura

3. Which customers have not ordered any item priced over 100?Note that one can equivalently use not in instead of not any. Thequery first selects the customer numbers from all rows from thejoin of tables order and item where the item price is over 100.Then it selects rows from table customer where the customernumber does not match any of the customers selected in the sub-query, and displays the customer names.

select c.cust_namefrom customer as cwhere c.cust_num not any



(select o.cust_numfrom order as o, item as iwhere o.item_num = i.item_numand i.price > 100);

customer name------------------Bones

4. Which customers have only ordered items weighing more than1000? This is an example of the universal quantifier all. First thesubquery selects all rows from table item where the item weightis over 1000. Then it selects rows from table order where all rowswith a given item number match at least one row in the setselected in the subquery. Any rows in order satisfying this condi-tion are joined with the customer table, and the customer nameis displayed as the final result.

select c.cust_namefrom customer as c, order as owhere c.cust_num = o.cust_numand o.item_num = all (select i.item_num from item as i where i.weight > 1000);

customer name------------------Sulu

Note that Kirk has ordered one item weighing over 1000 (star-Ship), but he has also ordered an item weighing under 1000 (cap-tainsLog), so his name does not get displayed.

A.3.2 SQL Update Commands

The following SQL update commands relate to our continuing exampleand illustrate typical usage of insertion, deletion, and update of selectedrows in tables.

This command adds one more customer (klingon) to the customertable:



insert into customer values (010,’klingon’,’rogueShip’,4);

This command deletes all customers with credit levels less than 2:

delete from customer where credit_level < 2;

This command modifies the credit level of any customer with level 6to level 7:

update customerset credit_level = 7where credit_level = 6;

A.3.3 Referential Integrity

The following update to the item table resets the value of item_num fora particular item, but because item_num is a foreign key in the ordertable, SQL must maintain referential integrity by triggering the execu-tion sequence named by the foreign key constraint on update cascade inthe definition of the order table (see Section A2). This means that, inaddition to updating a row in the item table, SQL will search the ordertable for values of item_num equal to 368 and reset each item_numvalue to 370.

update itemset item_num = 370where item_num = 368;

If this update had been a delete instead, such as the following:

delete from itemwhere item_num = 368;

then the referential integrity trigger would have caused the additionalexecution of the foreign key constraint on delete set default in order (asdefined in Section A.2), which finds every row in order with item_numequal to 368 and takes the action set up in the default. A typical actionfor this type of database might be to set item_num to either null or apredefined literal value to denote that the particular item has been



deleted; this would then be a signal to the system that the customerneeds to be contacted to change the order. Of course, the system wouldhave to be set up in advance to check for these values periodically.

A.3.4 SQL Views

A view in SQL is a named, derived (virtual) table that derives its datafrom base tables, the actual tables defined by the create table command.View definitions can be stored in the database, but the views (derivedtables) themselves are not stored; they are derived at execution timewhen the view is invoked as a query using the SQL select command. Theperson who queries the view treats the view as if it were an actual(stored) table, unaware of the difference between the view and the basetable.

Views are useful in several ways. First, they allow complex queries tobe set up in advance in a view, and the novice SQL user is only requiredto make a simple query on the view. This simple query invokes the morecomplex query defined by the view. Thus, nonprogrammers are allowedto use the full power of SQL without having to create complex queries.Second, views provide greater security for a database, because the DBAcan assign different views of the data to different users and control whatany individual user sees in the database. Third, views provide a greatersense of data independence; that is, even though the base tables may bealtered by adding, deleting, or modifying columns, the view query maynot need to be changed. While view definition may need to be changed,that is the job of the DBA, not the person querying the view.

Views may be defined hierarchically; that is, a view definition maycontain another view name as well as base table names. This enablessome views to become quite complex.

In the following example, we create a view called “orders” thatshows which items have been ordered by each customer and how many.The first line of the view definition specifies the view name and (inparentheses) lists the attributes of that view. The view attributes mustcorrelate exactly with the attributes defined in the select statement inthe second line of the view definition:

create view orders (customer_name, item_name, quantity) as select c.cust_name, i.item_name, o.quantity from customer as c, item as i, order as o where c.cust_num = o.cust_num and o.item_num = i.item_num;


A.4 References 229

The create view command creates the view definition, which definestwo joins among three base tables customer, item, and order; andSQL stores the definition to be executed later when invoked by a query.The following query selects all the data from the view “orders.” Thisquery causes SQL to execute the select command given in the precedingview definition, producing a tabular result with the column headings forcustomer_name, item_name, and quantity.

select * from orders;

Usually, views are not allowed to be updated, because the updateswould have to be made to the base tables that make up the definition ofthe view. When a view is created from a single table, the view update isusually unambiguous, but when a view is created from the joins of mul-tiple tables, the base table updates are very often ambiguous and mayhave undesirable side effects. Each relational system has its own rulesabout when views can and cannot be updated.

A.4 References

Bulger B., Greenspan, J., and Wall, D. MySQL/PHP Database Applications,2nd ed., Wiley, 2004.

Gennick, J. Oracle SQL*Plus: The Definitive Guide, O’Reilly, 1999.

Gennick, J. SQL Pocket Guide, O’Reilly, 2004.

Melton, J., and Simon, A. R. Understanding The New SQL: A CompleteGuide, Morgan Kaufmann, 1993.

Mullins, C. S. DB2 Developer’s Guide, 5th ed., Sams Publishing, 2004.

Neilson, P. Microsoft SQL Server Bible, Wiley, 2003.

van der Lans, R. Introduction to SQL: Mastering the Relational DatabaseLannguage, 3rd ed., Addison-Wesley, 2000.


231

Glossary

activity diagram (UML)—A process workflow model (diagram) show-ing the flow from one activity to the next.

aggregation—A special type of abstraction relationship that defines ahigher-level entity that is an aggregate of several lower-level enti-ties; a “part-of” type relationship. For example, a bicycle entitywould be an aggregate of wheel, handlebar, and seat entities.

association—A relationship between classes (in UML); associations canbe binary, n-ary, reflexive, or qualified.

attribute—A primitive data element that provides descriptive detailabout an entity; a data field or data item in a record. For example,lastname would be an attribute for the entity customer. Attributesmay also be used as descriptive elements for certain relationshipsamong entities.

automatic summary table (AST)—Materialized (summary) views oraggregates of data saved by OLAP for future use to reduce querytime.

binary recursive relationship—A relationship between one occur-rence of an entity with another occurrence of the same entity.

binary relationship—A relationship between occurrences of two enti-ties.

Boyce Codd normal form (BCNF)—A table is in Boyce Codd normalform if and only if for every functional dependency X->A, where


232 Glossary

X and A are either simple or composite attributes (data items), Xmust be a superkey in that table. This is a strong form of 3NF andis the basis for most practical normalization methodologies.

candidate key—Any subset of the attributes (data items) in a superkeythat is also a superkey and is not reducible to another superkey.

CASE tool—Computer-aided software engineering tool, a softwaredesign tool to assist in the logical design of large or complex data-bases. examples include ERwin Data Modeler and Rational Roseusing UML.

class—A concept in a real-world system, represented by a noun in UML;similar to an entity in the ER model.

class diagram (UML)—A conceptual data model; a model of the staticrelationships between data elements of a system (similar to an ERdiagram).

completeness constraint—Double-line symbol connecting a super-type entity with the subtypes to designate that the listed subtypeentities represent a complete set of possible subtypes.

composition—A relationship between one class and a group of otherclasses in UML; the class at the diamond (aggregate) end of therelationship is composed of the class(es) at the small (component)end; similar to aggregation in the ER model.

conceptual data model—An organization of data that describes therelationships among the primitive data elements. For example, inthe ER model, it is a diagram of the entities, their relationships,and their attributes.

connectivity of a relationship—A constraint on the count of thenumber of associated entity occurrences in a relationship, eitherone or many.

data item—The basic component of a data record in a file or databasetable; the smallest unit of information that has meaning in thereal world. Examples include customer lastname, address, andidentification number.

data model—An organization of data that describes the relationshipsamong primitive and composite data elements.

data warehouse—A large repository of historical data that can be inte-grated for decision support.

database—A collection of interrelated stored data that serves the needsof multiple users; a collection of tables in the relational model.


Glossary 233

database administrator (DBA)—The person in a software organiza-tion who is in charge of designing, creating, and maintaining thedatabases of an enterprise. The DBA makes use of a variety of soft-ware tools provided by a DBMS.

database life cycle—An enumeration and definition of the basic stepsin the requirements analysis, design, creation, and maintenanceof a database as it evolves over time.

database management system (DBMS)—A generalized softwaresystem for storing and manipulating databases. Examples includeOracle, IBM’s DB2, Microsoft SQL Server, or Access.

data mining—A way of extracting knowledge from a database bysearching for correlations in the data and presenting promisinghypotheses to the user for analysis and consideration.

DBA—See database administrator.

degree of a relationship—The number of entities associated in therelationship: recursive binary (1 entity), binary (2 entities), ternary(3 entities), n-ary (n entities).

denormalization—The consolidation of database tables to increaseperformance in data retrieval (query), despite the potential loss ofdata integrity. Decisions on when to denormalize tables are basedon cost/benefit analysis by the DBA.

deployment diagram (UML)—Shows the physical nodes on which asystem executes. This is more closely associated with physicaldatabase design.

dimension table—The smaller tables used in a data warehouse todenote the attributes of a particular dimension, such as time, loca-tion, customer characteristics, product characteristics, etc.

disjointness constraint (d)—A symbol in an ER diagram to designatethat the lower-level entities in a generalization relationship havenonoverlapping (disjoint) occurrences. If the occurrences overlap,then use the designation (o) in the ER diagram.

entity—A data object that represents a person, place, thing, or event ifinformational interest; it corresponds to a record in a file whenstored. For example, you could define employee, customer,project, team, and department as entities.

entity cluster—The result of a grouping operation on a collection ofentities and relationships in an ER model to form a higher-levelabstraction, which can be used to more easily keep track of themajor components of a large-scale global schema.


234 Glossary

entity instance (or occurrence)—A particular occurrence of anentity. For example, an instance of the entity actor would beJohnny Depp.

entity-relationship (ER) diagram—A diagram (or graph) of entitiesand their relationships, and possibly the attributes of those enti-ties.

entity-relationship (ER) model—A conceptual data model involv-ing entities, relationships among entities, and attributes of thoseentities.

exclusion constraint—A symbol (+) between two relationships in theER model with a common entity that implies that either one rela-tionship must hold at a given point in time, or the other musthold, but not both.

existence dependency—A dependency between two entities such thatone is dependent upon the other for its existence, and cannotexist alone. For example, an employee work-history entity cannotexist without the corresponding employee entity. Also refers tothe connectivity between two entities as being mandatory oroptional.

fact table—The dominating table in a data warehouse and its starschema, containing dimension attributes and data measures at theindividual data level.

fifth normal form (5NF)—A table is in fifth normal form (5NF) if andonly if there are no possible lossless decompositions into any sub-set of tables; in other words, if there is no possible lossless decom-position, then the table is in 5NF (see Section 6.5).

file—A collection of records of the same type. For example, anemployee file is a collection of employee records.

first normal form (1NF)—A table is in first normal form (1NF) if andonly if there are no repeating columns of data taken from thesame domain and having the same meaning.

foreign key—Any attribute in a SQL table (key or nonkey) that is takenfrom the same domain of values as the primary key in anotherSQL table and can be used to join the two tables (without loss ofdata integrity) as part of a SQL query.

fourth normal form (4NF)—A table is in fourth normal form (4NF) ifand only if it is at least in BCNF and if whenever there exists anontrivial multivalued dependency of the form X->>Y, then Xmust be a superkey in the table.


Glossary 235

functional dependency (FD)—The property of one or moreattributes (data items) that uniquely determines the value of oneor more other attributes (data items). Given a table R, a set ofattributes B is functionally dependent on another set of attributesA if, at each instant of time, each A value is associated with onlyone B value.

generalization—A special type of abstraction relationship that speci-fies that several types of entities with certain common attributescan be generalized (or abstractly defined) with a higher-levelentity type, a supertype entity; an “is-a” type relationship. Forexample, employee is a generalization of engineer, manager, andadministrative assistant, based on the common attribute job-title.A tool often used to make view integration possible.

global schema—A conceptual data model that shows all the data andtheir relationships in the context of an entire database.

key—A generic term for a set of one or more attributes (data items) that,taken collectively, enables one to identify uniquely an entity or arecord in a SQL table; a superkey.

logical design—The steps in the database life cycle involved with thedesign of the conceptual data model (schema), schema integra-tion, transformation to SQL tables, and table normalization; thedesign of a database in terms of how the data is related, but with-out regard to how it will be stored.

lossless decomposition—A decomposition of a SQL table into two ormore smaller tables is lossless if and only if the cycle of tabledecomposition (normalization) and the recomposition (joiningthe tables back through common attributes) can be done withoutloss of data integrity.

mandatory existence—A connectivity between two entities that hasa lower bound of one. One example is the “works-in” relationshipbetween an employee and a department; every department has atleast one employee at any given time. Note: if this is not true,then the existence is optional.

multiplicity—In UML, the multiplicity of a class is an integer thatindicates how many instances of that class are allowed to exist.

multivalued dependency (MVD)—The property of a pair of rows ina SQL table such that if the two rows contain duplicate values ofattribute(s) X, then there is also a pair of rows obtained by inter-changing the values of Y in the original pair. This is a multivalued


236 Glossary

dependency of X->>Y. For example, if two rows have the attributevalues ABC and ADE, where X=A and Y has the values C and E,then the rows ABE and ADC must also exist for an MVD to occur.A trivial MVD occurs when Y is a subset of X or X union Y is theentire set of attributes in the table.

normalization—The process of breaking up a table into smaller tablesto eliminate problems with unwanted loss of data (the egregiousside effects of losing data integrity) from the deletion of recordsand inefficiencies associated with multiple data updates.

online analytical processing (OLAP)—A query service that overlaysa data warehouse by creating and maintaining a set of summaryviews (automatic summary tables, or ASTs) to enable quick accessto summary data.

optional existence—A connectivity between two entities that has alower bound of zero. For example, for the “occupies” relationshipbetween an employee and an office, there may exist some officesthat are not currently occupied.

package—In UML, a package is a graphical mechanism used to orga-nize classes into groups for better readability.

physical design—The step in the database life cycle involved with thephysical structure of the data; that is, how it will be stored,retrieved, and updated efficiently. In particular, it is concernedwith issues of table indexing and data clustering on secondarystorage devises (disk).

primary key—A key that is selected from among the candidate keysfor a SQL table to be used to create an index for that table.

qualified association—In UML, an association between classes mayhave constraints specified in the class diagram.

record—A group of data items treated as a unit by an application; a rowin a database table.

referential integrity—A constraint in a SQL database that requires,for every foreign key instance that exists in a table, that the row(and thus the primary key instance) of the parent table associatedwith that foreign key instance must also exist in the database.

reflexive association—In UML, a reflexive association relates a classto itself.

relationship—A real-world association among one or more entities.For example, “purchased” could be a relationship between cus-tomer and product.


Glossary 237

requirements specification—A formal document that defines therequirements for a database in terms of the data needed, the majorusers and their applications, the physical platform and softwaresystem, and any special constraints on performance, security, anddata integrity.

row—A group of data items treated as a unit by an application; a record;a tuple in relational database terminology.

schema—A conceptual data model that shows all the relationshipsamong the data elements under consideration in a given context;the collection of table definitions in a relational database.

second normal form (2NF)—A table is in second normal form (2NF)if and only if each nonkey attribute (data item) is fully dependenton the primary key, that is either the left side of every functionaldependency (FD) is a primary key or can be derived from a pri-mary key.

star schema—The basic form of data organization for a data ware-house, consisting of a single large fact table and many smallerdimension tables.

subtype entity—The higher-level abstract entity in a generalizationrelationship.

superkey—A set of one or more attributes (data items) that, taken col-lectively, allows the unique identification of an entity or a recordin a relational table.

supertype entity—The lower-level entity in a generalization relation-ship.

table—In a relational database, the collection of rows (or records) of asingle type (similar to a file).

ternary relationship—A relationship that can only be defined amongoccurrences of three entities.

third normal form (3NF)—A table is in third normal form (3NF) ifand only if for every functional dependency X->A, where X and Aare either simple or composite attributes (data items), either Xmust be a superkey or A must be a member attribute of a candidatekey in that table.

UML—Unified Modeling Language; a popular form of diagrammingtools used to define data models and processing steps in a softwareapplication.


238 Glossary

view integration—A step in the logical design part of the database lifecycle that collects individual conceptual data models (views) intoa single unified global schema. Techniques such as generalizationare used to consolidate the individual data models.


239

References

Ambler, S. Agile Database Techniques: Effective Strategies for theAgile Software Developer, Wiley, 2003.

Bachman, C. W. “Data Structure Diagrams,” Database 1,2 (1969), pp.4–10.

Bachman, C. W. “The Evolution of Storage Structures,” Comm. ACM15,7 (July 1972), pp. 628–634.

Bachman, C. W. “The Role Concept in Data Models,” Proc. 3rd Intl.Conf. on Very Large Data Bases, Oct. 6, 1977, IEEE, New York, pp.464–476.

Barquin, R., and Edelstein, H. (editors). Planning and Designing theData Warehouse, Prentice-Hall, Upper Saddle River, NJ, 1997,Chapter 10 (OLAP).

Batini, C., Lenzerini, M., and Navathe, S. B. “A Comparative Analysisof Methodologies for Database Schema Integration,” ACM Com-puting Surveys 18,204 (Dec. 1986), pp. 323–364

Batini, C., Ceri, S., and Navathe, S. Conceptual Database Design: AnEntity-Relationship Approach, Benjamin/Cummings, 1992.


240 References

Beeri, C., Fagin, R., and Howard, J. H. “A Complete AxiomizationFunctional and Multivalued Dependencies in Database Rela-tions,” Proc. 1977 ACM SIGMOD Int’l. Conf. on Management ofData, Toronto, 1977, pp. 47–61.

Beeri, C., Bernstein, P., and Goodman, N. “A Sophisticated Introduc-tion to Database Normalization Theory,” Proc. 4th Intl. Conf. onVery Large Data Bases, Sept. 13–15, 1978, IEEE, New York, pp.113–124.

Bernstein, P. “Synthesizing 3NF Tables from Functional Dependen-cies,” ACM Trans. Database Systems 1,4(1976), pp. 272–298.

Briand, H., Habrias, H., Hue, J., and Simon, Y. “Expert System forTranslating E-R Diagram into Databases,” Proc. 4th Intl. Conf. onEntity-Relationship Approach, IEEE Computer Society Press, 1985,pp. 199–206.

Brodie, M. L., Mylopoulos, J., and Schmidt, J. (editors). On ConceptualModeling: Perspectives from Artificial Intelligence, Databases, andProgramming Languages, Springer-Verlag, New York, 1984.

Bruce, T. A. Designing Quality Databases with IDEF1X Information Mod-els, Dorset House, New York, 1992.

Bulger B., Greenspan, J., and Wall, D. MySQL/PHP Database Applica-tions, 2nd ed., Wiley, 2004.

Burleson, D. K. Physical Database Design Using Oracle, CRC Press,2004.

Cardenas, A. F. “Analysis and Performance of Inverted DatabaseStructures,” Communications of the ACM, vol. 18, no. 5, pp.253–264, May 1975.

Cataldo, J. “Care and Feeding of the Data Warehouse,” Database Prog.& Design 10, 12 (Dec. 1997), pp. 36–42.

Chaudhuri, S., and Dayal, U. “An Overview of Data Warehousingand OLAP Technology,” SIGMOD Record 26, 1 (March 1997), pp.65–74.

Chen, P. P. “The Entity-Relationship Model—Toward a Unified View ofData,” ACM Trans. Database Systems 1,1 (March 1976), pp. 9–36.

Chen and Associates, Inc. ER Designer (user manual), 1987.


References 241

Cobb, R. E., Fry, J. P., and Teorey, T. J. “The Database Designer’sWorkbench,” Information Sciences 32,1 (Feb. 1984), pp. 33–45.

Codd, E. F. “A Relational Model for Large Shared Data Banks,” Comm.ACM 13,6 (June 1970), pp. 377–387.

Codd, E. F. “Recent Investigations into Relational Data Base Sys-tems,” Proc. IFIP Congress, North-Holland, Amsterdam, 1974.

Codd, E. F. “Extending the Database Relational Model to CaptureMore Meaning,” ACM TODS 4,4 (Dec. 1979), pp. 397–434.

Codd, E. F. The Relational Model for Database Management, Version 2,Addison-Wesley, 1990.

Computer Associates, AllFusion ERwin Data Modeler,http://www3.ca.com/Solutions/Product.asp?ID=260

Database Answers, Modeling tools,http://www.databaseanswers.com/modelling_tools.htm

Date, C. J. An Introduction to Database Systems, Vol. 1, 7th ed., Addi-son-Wesley, 1999.

Directory of Data Modeling Resources,http://www.infogoal.com/dmc/dmcdmd.htm

Dittrich, K. R., Gotthard, W., and Lockemann, P. C. “Complex Enti-ties for Engineering Applications,” Proc. 5th ER Conf., North-Hol-land, Amsterdam, 1986.

Dutka, A. F., and Hanson, H. H. Fundamentals of Data Normalization,Addison-Wesley, Reading, MA, 1989.

Elmasri, R. and Navathe, S. B. Fundamentals of Database Systems, 4thed., Addison-Wesley, 2003.

Fagin, R. “Multivalued Dependencies and a New Normal Form forRelational Databases,” ACM Trans. Database Systems 2,3 (1977),pp. 262–278.

Fagin, R. “Normal Forms and Relational Database Operators,” Proc.1979 ACM SIGMOD Conf. on Mgmt. of Data, pp. 153–160.

Faloutsos, C., Matias, Y., and Silberschatz, A. “Modeling Skewed Dis-tributions Using Multifractal and the ‘80–20 Law’,” in Proceedingsof the 22nd VLDB Conference, pp. 307–317, 1996.


242 References

Feldman, P., and Miller, D. “Entity Model Clustering: Structuring aData Model by Abstraction,” Computer Journal 29,204 (Aug.1986), pp. 348–360.

Fowler, M. Patterns of Enterprise Application Architecture, Addison-Wes-ley, 2002.

Gennick, J. Oracle SQL*Plus: The Definitive Guide, O’Reilly, 1999.

Gennick, J. SQL Pocket Guide, O’Reilly, 2004.

Gray, P., and Watson, H. J. Decision Support in the Data Warehouse,Prentice-Hall, Upper Saddle River, NJ, 1998.

Halpin, T. Information Modeling and Relational Databases: From Con-ceptual Analysis to Logical Design, Morgan Kaufmann, 2001.

Hammer, M., and McLeod, D. “Database Description with SDM: ASemantic Database Model,” ACM Trans. Database Systems 6,3(Sept. 1982), pp. 351–386.

Han, J., and Kamber, M., Data Mining: Concepts and Techniques, Mor-gan Kaufmann Publishers, San Francisco, 2001.

Harinarayan, V., Rajaraman, A., and Ullman, J. D., “ImplementingData Cubes Efficiently,” in Proceedings of the 1996 ACM-SIGMODConference, pp. 205–216, 1996.

Harriman, A., Hodgetts, P., and Leo, M. “Emergent Database Design:Liberating Database Development with Agile Practices,” AgileDevelopment Conference, Salt Lake City, June 22–26, 2004.http://www.agiledevelopmentconference.com/files/XR2-2.pdf

Hawryszkiewycz, I. Database Analysis and Design, SRA, Chicago,1984.

Hernandez, M. J. and Getz, K. Database Design for Mere Mortals: AHands-On Guide for Relational Databases, 2nd ed., Addison-Wes-ley, 2003.

Hull, R., and King, R. “Semantic Database Modeling: Survey, Applica-tions, and Research Issues,” ACM Computing Surveys 19, 3 (Sept.1987), pp. 201–260.

IBM Rational Software, http://www-306.ibm.com/software/rational/

IDEF1X, http://www.idef.com


References 243

Jajodia, S., and Ng, P. “Translation of Entity-Relationship Diagramsinto Relational Structures,” J. Systems and Software 4,2 (1984), pp.123–133.

Jensen, C. S. and Snodgrass, R. T. “Semantics of Time-Varying Infor-mation,” Information Systems 21,4 (1996), pp. 311–352.

Kent, W. “A Simple Guide to Five Normal Forms in Relational Data-base Theory,” Comm. ACM 26,2 (Feb. 1983), pp. 120–125.

Kent, W. “Fact-Based Data Analysis and Design,” J. Systems and Soft-ware 4 (1984), pp. 99–121.

Kimball, R., and Caserta, J. The Data Warehouse ETL Toolkit, 2nd ed.,Wiley, 2004.

Kimball, R., and Ross, M. The Data Warehouse Lifecycle Toolkit, 2nded., Wiley, 1998.

Kimball, R., and Ross, M. The Data Warehouse Toolkit: The CompleteGuide to Dimensional Modeling, 2nd ed., Wiley, 2002.

Kotidis, Y., and Roussopoulos, N. “DynaMat: A Dynamic View Man-agement System for Data Warehouses,” in Proceedings SIGMOD’99, pp. 371–382, 1999.

Maier, D. Theory of Relational Databases, Computer Science Press,Rockville, MD, 1983.

Makridakis, S., Wheelwright, C., and Hyndman, R. J. ForecastingMethods and Applications, 3rd ed., John Wiley & Sons, 1998.

Martin, J. Strategic Data-Planning Methodologies, Prentice-Hall, 1982.

Martin, J. Managing the Data-Base Environment, Prentice-Hall, 1983.

McGee, W. “A Contribution to the Study of Data Equivalence,” DataBase Management, J. W. Klimbie and K. L. Koffeman (eds.), North-Holland, Amsterdam, 1974, pp. 123–148.

McLeod, D., and King, R. “Applying a Semantic Database Model,”Proc. 1st Intl. Conf. on the Entity-Relationship Approach to SystemsAnalysis and Design, North-Holland, Amsterdam, 1979, pp. 193–210.

Melton, J., and Simon, A. R. Understanding The New SQL: A CompleteGuide, Morgan Kaufmann Pub., 1993.


244 References

Mitchell, T. M. Machine Learning, WCB/McGraw-Hill, Boston, 1997.

Muller, R. Database Design for Smarties: Using UML for Data Modeling,Morgan Kaufmann Pub., 1999.

Mullins, C. S. DB2 Developer’s Guide, 5th ed., Sams Publishing, 2004.

Nadeau, T. P., and Teorey, T. J. “Achieving Scalability in OLAP Material-ized View Selection,” In Proceedings of DOLAP ’02, pp 28–34, 2002.

Nadeau, T. P., and Teorey, T. J. “A Pareto Model for OLAP View SizeEstimation,” Information Systems Frontiers, vol. 5, no. 2, pp. 137–147, Kluwer Academic Publishers, 2003.

Naiburg, E. J., and Maksimchuk, R. A. UML for Database Design, Addi-son-Wesley, 2001.

Neilson, P. Microsoft SQL Server Bible, Wiley, 2003.

Nijssen, G. M., and Halpin, T. A. Conceptual Schema and RelationalDatabase Design: A Fact Oriented Approach, Prentice-Hall, 1989.

Objects by Design: UML modeling tools,http://www.objectsbydesign.com/tools/umltools_byCompany.html

Peckham, J., and Maryanski, F. “Semantic Data Models,” ACM Com-puting Surveys 20, 3 (Sept. 1988), pp. 153–190.

Quatrani, T. Visual Modeling with Rational Rose 2002 and UML, 3rded., Addison-Wesley, 2003.

Ramakrishnan, R., and Gehrke, J. Database Management Systems, 3rded., McGraw Hill, 2004, Chapter 20.

Reiner, D., Brown, G., Friedell, M., Kramlich, D., Lehman, J., McKee,R., Rheingans, P., and Rosenthal, A. “A Database Designer’s Work-bench,” Proc. 5th ER Conference, North-Holland, Amsterdam,1986, pp. 347–360.

Rumbaugh, J., Jacobson, I., and Booch, G. The Unified Modeling Lan-guage User Guide, 2nd ed., Addison-Wesley, 2004.

Rumbaugh, J., Jacobson, I., and Booch, G. The Unified Modeling Lan-guage Reference Manual, 2nd ed., Addison-Wesley, 2005.


References 245

Sakai, H. “Entity-Relationship Approach to Logical DatabaseDesign,” Entity-Relationship Approach to Software Engineering, C. G.Davis, S. Jajodia, P.A. Ng, and R. T. Yeh (editors), Elsevier, North-Holland, Amsterdam, 1983, pp. 155–187.

Scheuermann, P., Scheffner, G., and Weber, H. “Abstraction Capabili-ties and Invariant Properties Modelling within the Entity-Rela-tionship Approach,” Entity-Relationship Approach to Systems Anal-ysis and Design, P. Chen (editor), Elsevier, North-Holland, Amster-dam, 1980, pp. 121–140.

Senko, M. et al. “Data Structures and Accessing in Database Sys-tems,” IBM Syst. J. 12,1 (1973), pp. 30–93.

Silberschatz, A., Korth, H. F., and Sudarshan, S. Database System Con-cepts 4th ed., McGraw-Hill, 2002.

Simon, A. R. Strategic Database Technology: Management for the Year2000, Morgan Kaufmann, San Francisco, 1995.

Simsion, G. C., and Witt, G. C. Data Modeling Essentials: Analysis,Design, and Innovation, 2nd ed., Coriolis, 2001.

Smith, H. “Database Design: Composing Fully Normalized Tablesfrom a Rigorous Dependency Diagram,” Comm. ACM 28,8(1985), pp. 826–838.

Smith, J., and Smith, D. “Database Abstractions: Aggregation andGeneralization,” ACM Trans. Database Systems 2,2 (June 1977),pp. 105–133.

Snodgrass, R. T. Developing Time-Oriented Database Applications inSQL, Morgan Kaufmann Publishers, 2000.

Stephens, R., and Plew, R. Database Design, Sams, 2000.

Sybase PowerDesigner,http://www.sybase.com/products/developmentintegration/powerdesigner

Teichroew, D., and Hershey, E. A. “PSL/PSA: A Computer Aided Tech-nique for Structured Documentation and Analysis of InformationProcessing Systems,” IEEE Trans. Software Engr. SE-3,1 (1977), pp.41–48.


246 References

Teorey, T., and Fry, J. Design of Database Structures, Prentice-Hall,1982.

Teorey, T. J., Wei, G., Bolton, D. L., and Koenig, J. A. “ER Model Clus-tering as an Aid for User Communication and Documentation inDatabase Design,” Comm. ACM 32,8 (Aug. 1989), pp. 975–987.

Teorey, T. J., Yang, D., and Fry, J. P. “A Logical Design Methodologyfor Relational Databases Using the Extended Entity-RelationshipModel,” ACM Computing Surveys 18,2 (June 1986), pp. 197–222.

Thomsen, E. OLAP Solutions, Wiley, 1997.

Tsichritzis, D., and Lochovsky, F. Data Models, Prentice-Hall, 1982.

Ubiquiti Inc. Web site: http://www.ubiquiti.com/

UML Overview, from Developer.com,http://www.developer.com/design/article.php/1553851

Understanding relational databases: referential integrityhttp://www.miswebdesign.com/resources/articles/wrox-beginning-php-4-chapter-3-5.html

The University of Waikato, Weka 3—Data Mining with Open SourceMachine Learning Software in Java.http://www.cs.waikato.ac.nz/ml/weka.

van der Lans, R. Introduction to SQL: Mastering the Relational DatabaseLanguage, 3rd ed., Addison-Wesley, 2000.

Widom, J. Data Management for XML.http://www-db.stanford.edu/~widom/xml-whitepaper.html

Wilmot, R. “Foreign Keys Decrease Adaptability of DatabaseDesigns,” Comm. ACM 27,12 (Dec. 1984), pp. 1237–1243.

Witten, I., and Frank, E., Data Mining: Practical Machine Learning Toolsand Techniques with Java Implementations, Morgan Kaufmann Pub-lishers, San Francisco, 2000.

Wong, E., and Katz, R. “Logical Design and Schema Conversion forRelational and DBTG Databases,” Proc. Intl. Conf. on the Entity-Relationship Approach, 1979, pp. 311–322.

Yao, S. B. (editor). Principles of Database Design, Prentice-Hall, 1985.


References 247

Zachmann, J. A. “A Framework for Information Systems Architec-ture,” IBM Systems Journal 26,3 (1987), IBM Pub. G321-5298.

Zachmann Institute for Framework Advancement,http://www.zifa.com.


249

Exercises

ER and UML Conceptual Data Modeling

Problem 2-1

Draw a detailed ER diagram for an car rental agency database (e.g.,Hertz), keeping track of the current rental location of each car, its cur-rent condition and history of repairs, customer information for a localoffice, expected return date, return location, car status (ready, being-repaired, currently-rented, being-cleaned). Select attributes from yourintuition about the situation and list them separately from the diagram,but associated with a particular entity or relationship in the ER model.

Problem 2-2

Given the following assertions for a relational database that representsthe current term enrollment at a large university, draw an ER diagram forthis schema that takes into account all the assertions given. There are2,000 instructors, 4,000 courses, and 30,000 students. Use as many ERconstructs as you can to represent the true semantics of the problem.


250 Exercises

Assertions:

• An instructor may teach one or more courses in a given term(average is 2.0 courses).

• An instructor must direct the research of at least one student(average = 2.5 students).

• A course may have none, one, or two prerequisites (average = 1.5prerequisites).

• A course may exist even if no students are currently enrolled.

• Each course is taught by exactly one instructor.

• The average enrollment in a course is 30 students.

• A student must select at least one course per term (average = 4.0course selections).

Problem 3-1

Draw UML class diagrams for an car rental agency database (e.g., Hertz),keeping track of the current rental location of each car, its current condi-tion and history of repairs, customer information for a local office,expected return date, return location, car status (ready, being-repaired,currently-rented, being-cleaned). Select attributes from your intuitionabout the situation.

Draw one diagram showing the relationships of the classes withoutthe attributes listed, then show each class individually with theattributes listed.

Problem 3-2

Given the following assertions for a relational database that representsthe current term enrollment at a large university, draw an UML diagramfor this schema that takes into account all the assertions given. There are2,000 instructors, 4,000 courses, and 30,000 students. Use as many UMLconstructs as you can to represent the true semantics of the problem.

Assertions:

• An instructor may teach none, one, or more courses in a giventerm (average is 2.0 courses).


Exercises 251

• An instructor must direct the research of at least one student(average = 2.5 students).

• A course may have none, one, or two prerequisites (average = 1.5prerequisites).

• A course may exist even if no students are currently enrolled.

• Each course is taught by exactly one instructor.

• The average enrollment in a course is 30 students.

• A student must select at least one course per term (average = 4.0course selections).

Conceptual Data Modeling and Integration

Problem 4-1

The following ER diagrams represent two views of a video store databaseas described to a database designer. Show how the two views can be inte-grated in the simplest and most useful way by making all necessarychanges on the two diagrams. State any assumptions you need to make.


252 Exercises

Transformation of the Conceptual Model to SQL

Problem 5-1

1. Transform your integrated ER diagram from Problem 4-1 into aSQL database with five to ten rows per table of data you make upto fit the database schema.

2. Demonstrate your database by displaying all the queries below:

a. Which video store branches have Shrek in stock (available)now?

b. In what section of the store (film category) can you find Termi-nator?

c. For customer “Anika Sorenstam,” what titles are currentlybeing rented and what are the overdue charges, if any?

d. Any query of your choice. (Show me what your system canreally do!)

Normalization and Minimum Set of Tables

Problem 6-1

Given the table R1(A, B, C) with FDs A -> B and B -> C:

1. Is A a superkey for this table? _____________

2. Is B a superkey for this table? _____________

3. Is this table in 3NF, BCNF, or neither? _____________

Problem 6-2

Given the table R(A,B,C,D) with FDs AB->C and BD->A:

1. What are all the superkeys of this table? _____________

2. What are all the candidate keys for this table? _____________

3. Is this table in 3NF, BCNF, or neither? _____________


Exercises 253

Problem 6-3

1. From the ER diagram given below and the resulting tables andcandidate key attributes defined, what are all the functionaldependencies (FDs) you can derive just by looking at the diagramand list of candidate keys?

Table Candidate Key(s)customer cidorder ordernodepartment deptnosalesperson siditem itemnoorder-dept-sales orderno,sid AND orderno,deptnoorder-item-sales orderno, itemno

Table FDscustomerorderdepartment salespersonitemorder-dept-sales order-item-sales

2. What is the level of normalization for each of these tables, basedon the information given?

Problem 6-4

The following functional dependencies (FDs) represent a set of airlinereservation system database constraints. Design a minimum set of BCNFtables, preserving all FDs, and express your solution in terms of the codeletters given below (a time-saving device for your analysis). Is the set oftables you derived also BCNF?

reservation_no -> agent_no, agent_name, airline_name, flight_no, passenger_name

reservation_no -> aircraft_type, departure_date, arrival_date, departure_time, arrival_time


254 Exercises

reservation_no -> departure_city, arrival_city, type_of_payment, seating_class, seat_no

airline_name, flight_no -> aircraft_type, departure_time, arrival_time

airline_name, flight_no -> departure_city, arrival_city, meal_type

airline_name, flight_no, aircraft_type -> meal_type

passenger_name -> home_address, home_phone, company_name

aircraft_type, seat_no -> seating_class

company_name -> company_address, company_phone

company_phone -> company_name

A: reservation_no L: departure_city

B: agent_no M: arrival_city

C: agent_name N: type_of_payment

D: airline_name P: seating_class

E: flight_no Q: seat_no

F: passenger_name R: meal_type

G: aircraft_type S: home_address

H: departure_date T: home_phone

I: arrival_date U: company_name

J: departure_time V: company_address

K: arrival_time W: company_phone

Problem 6-5

Given the following set of FDs, find the minimum set of 3NF tables.Designate the candidate key attributes of these tables. Is the set of tablesyou derived also BCNF?

1. J -> KLMNP

2. JKL -> MNP

3. K -> MQ

4. KL -> MNP

5. KM -> NP

6. N -> KP


Exercises 255

Problem 6-6

Using only the given set of functional dependencies (FDs), find the min-imum set of BCNF tables. Show the results of each step of Bernstein’salgorithm separately. What are the candidate keys for each table?

1. ABC -> DEFG

2. B -> CH

3. A -> HK

4. CD -> GKM

5. D -> CP

6. E -> FG

7. G -> CDM

Problem 6-7

Given the FDs listed below, determine the minimum set of 3NF tables,preserving all FDs. Define the candidate keys of each table and deter-mine which tables are also BCNF.

1. ABC -> H

2. AC -> BDEFG

3. ACDF -> BG

4. AW->BG

5. B -> ACF

6. H -> AXY

7. M -> NZ

8. MN -> HPT

9. XY -> MNP

Problem 6-8

Given the following FDs, determine the minimum set of 3NF tables.Make sure that all FDs are preserved. Specify the candidate keys of eachtable. Note that each letter represents a separate data element (attribute).Is the set of tables you derived also BCNF?


256 Exercises

1. ABCD -> EFGHIJK 8. JK -> B

2. ACD -> JKLMN 9. MN -> ACD

3. A -> BH 10. L -> JK

4. B -> JKL 11. PQ -> S

5. BH -> PQR 12. PS -> JKQ

6. BL -> PS 13. PSR -> QT

7. EF -> ABCDH

Problem 6-9

Given the following FDs, determine the minimum set of 3NF tables.Make sure that all FDs are preserved. Specify the candidate keys of eachtable. Note that each letter represents a separate data element (attribute).Is the set of tables you derived also BCNF?

1. A -> BGHJ 5. EB -> AF

2. AG -> HK 6. EF -> A

3. B -> K 7. H -> J

4. EA -> F 8. J -> AB

Problem 6-10

FDs and MVDs

Answer each question yes or no, and justify each answer. In most cases,you will be given a table R with a list of attributes, with at most one can-didate key (the candidate key may be either a single attribute or compos-ite attribute key, shown underlined).

Given table R(A,B,C,D) and the functional dependency AB->C:

1. Is R in 3NF?

2. Is R in BCNF?

3. Does the multivalued dependency AB ->>C hold?

4. Does the set {R1(A,B,C), R2(A,B,D)} satisfy the lossless join prop-erty?


Exercises 257

Given table R(A,B,C) and the set {R1(A,B), R2(B,C)} satisfies the loss-less decomposition property:

1. Does the multivalued dependency B->>C hold?

2. Is B a candidate key?

3. Is R in 4NF?

Given a table “skills_available” with attributes empno, project, andskill, in which the semantics of “skills_available” state that everyskill an employee has must be used on every project that employeeworks on:

1. Is the level of normalization of “skills_available” at least 4NF?

2. Given table R(A,B,C) with actual data shown below:

a. Does the multivalued dependency B->>C hold? b. Is R in 5NF?

R: A B Cw x pw x qz x pz x qw y qz y p

Logical Database Design (Generic Problem)

Problem 7-1

Design and implement a small database that will be useful to your com-pany or student organization.

1. State the purpose of the database in a few sentences.

2. Construct an ER or UML class diagram for the database.

3. Transform your ER or UML diagram into a working database withfive to ten rows per table of data you can make up to fit the data-


258 Exercises

base schema. You should have at least four tables, enough to havesome interesting queries. Use Oracle, DB2, SQL Server, Access, orany other database system.

4. Show that your database is normalized (BCNF) using FDs derivedfrom your ER diagram and from personal knowledge of the data.Analyze the FDs for each table separately (this simplifies the pro-cess).

5. Demonstrate your working database by displaying the results offour queries. Pick interesting and complex queries (impress us!).

OLAP

Problem 8-1

As mentioned in Chapter 8, hypercube lattice structures are a specializa-tion of product graphs. Figure 8.16 shows an example of a three-dimen-sional hypercube lattice structure. Figure 8.13 shows an example of atwo-dimensional product graph. Notice that the two figures are writtenusing different notations. Write the hypercube lattice structure in Figure8.16 using the product graph notation introduced with Figure 8.13. Keepthe same dimension order. Don’t worry about carrying over the viewsizes. Assume the Customer, Part, and Supplier dimensions are keyed by“customer id,” “part id,” and “supplier id,” respectively. Shade the nodesrepresenting the fact table and the views selected for materialization asindicated in Section 8.2.4.


259

Solutions to SelectedExercises

Problem 2-2


260 Solutions to Selected Exercises

Problem 4-1

Connect Movie to Video-copy as a 1-to-N relationship (Video-copy atthe N side); or, use a generalization from Movie to Video-copy, withMovie being the supertype and Video-copy as the subtype.

Problem 6-1

Given the table R1(A, B, C) with FDs A -> B and B -> C:

1. Is A a superkey for this table? Yes

2. Is B a superkey for this table? No

3. Is this table in 3NF, BCNF, or neither? Neither 3NF nor BCNF

Problem 6-3

Problem 6-5

Given these FDs, begin Step 1 (LHS reduction):

1. J -> KLMNP

2. JKL -> MNP First, eliminate K and L since J -> KL in (1);merge with (1)

Table FDsLevel ofNormalization

customer cid -> cname, caddress BCNF

order orderno -> cid BCNF

department NONE BCNF

salesperson sid -> deptno BCNF

item itemno -> deptno, itemname, size BCNF

order-dept-sales orderno, sid -> deptno BCNF

orderno, deptno -> sid BCNF

order-item-sales orderno, itemno -> sid BCNF


Solutions to Selected Exercises 261

3. K -> MQ

4. KL -> MNP Third, eliminate L since K -> MNP from merged (3), (4) is redundant

5. KM -> NP Second, eliminate M since K -> M in (3); merge with (3)

6. N -> KP

End of Step 1, begin Step 2 (RHS reduction for transitivities):

1. J -> KLMNP First, reduce by eliminating MNP since K -> MNP

2. K -> MQNP Second, reduce by eliminating P since N -> P

3. N -> KP

End of Step 2 and consolidation in Step 3:

1. J -> KL

2. K -> MNQ (or K -> MNPQ) First, merge (2) and (3) for superkey rules 1 and 2

3. N -> KP (or N -> K)

Steps 4 and 5:

1. J -> KL Candidate key is J (BCNF)

2. K -> MNPQ and N -> K Candidate keys are K and N (BCNF)

Problem 6-7

Given these FDs:

1. ABC -> H Step 1: B is extraneous due to AC -> B in (3)

2. ACDF -> BG Step 1: EF are extraneous due to AC -> DFin (3)

3. AC -> BDEFG Step 2: eliminate F from RHS due to B -> F from (5)

4. AW-> BG Step 2: eliminate G from RHS due to B -> AC -> G from (3) and (5)

5. B -> ACF


262 Solutions to Selected Exercises

6. H -> AXY

7. M -> NZ

8. MN -> HPT Step 1: N is extraneous due to M -> N in (7)

9. XY -> MNP Step 2: eliminate NP from the RHS due to M -> NP from (7,8)

After Step 3, using the union axiom:

1. AC -> BDEGH Combining (1), (2), and (3)

2. AW->B

3. B -> ACF

4. H -> AXY

5. M -> HNPTZ Combining (7) and (8)

6. XY -> M

Step 4, merging:

1. AC -> BDEFGH, B -> AC, H -> A using Rule 1 for AC being a super-key, Rule 2 for B being a superkey, the definition of 3NF, and Abeing a prime attribute. 3NF only.

2. AW->BG using Rule 1 for AW to be a superkey. BCNF.

3. H -> XY, XY -> M, and M -> HNPTZ using Rule 1 for H being asuperkey (after taking H to its closure H -> XYMNPTZ), using Rule2 for M being a superkey from M -> H, and Rule 2 for XY being asuperkey from XY -> M. BCNF. Note: H->AXY is also possible here.

Step 5, minimum set of normalized tables:

Table 1: ABCDEFGH with superkeys AC, B (3NF only)

Table 2: ABGW with superkey AW (3NF and BCNF)

Table 3: HMNPTXYZ with superkeys H, XY, M (3NF and BCNF)


263

About the Authors

Toby Teorey is a professor in the Electrical Engineering and Com-puter Science Department at the University of Michigan, Ann Arbor.He received his B.S. and M.S. degrees in electrical engineering from theUniversity of Arizona, Tucson, and a Ph.D. in computer science fromthe University of Wisconsin, Madison. He was chair of the 1981 ACMSIGMOD Conference and program chair of the 1991 Entity-Relation-ship Conference. Professor Teorey’s current research focuses on data-base design and performance of computing systems. He is a member ofthe ACM.

Sam Lightstone is a Senior Technical Staff Member and Develop-ment Manager with IBM’s DB2 Universal Database development team.He is the cofounder and leader of DB2’s autonomic computing R&Deffort. He is also a member of IBM’s Autonomic Computing ArchitectureBoard, and in 2003 he was elected to the Canadian Technical ExcellenceCouncil, the Canadian affiliate of the IBM Academy of Technology. Hiscurrent research includes numerous topics in autonomic computing andrelational DBMSs including automatic physical database design, adap-tive self-tuning resources, automatic administration, benchmarkingmethodologies, and system control. Mr. Lightstone is an IBM MasterInventor with over 25 patents and patents pending, and he has pub-lished widely on Autonomic Computing for relational database systems.He has been with IBM since 1991.


264 About the Authors

Tom Nadeau is a senior technical staff member of Ubiquiti Inc.and works in the area of data and text mining. He received his B.S.degree in computer science and M.S. and Ph.D. degrees in electricalengineering and computer science from the University of Michigan,Ann Arbor. His technical interests include data warehousing, OLAP,data mining and machine learning. He won the best paper award at the2001 IBM CASCON Conference.


265

Index

Activity diagrams, 46–50control flow icons, 46, 47database design and, 50decisions, 46, 47defined, 34, 46example, 49flows, 46, 47forks, 47–48joins, 47, 48nodes, 46, 47notation description, 46–48for workflow, 48–50See also UML diagrams

Aggregate functions, 222–24Aggregation, 25

composition vs., 41defined, 25ER model, 101illustrated, 25UML constructs, 41UML model, 102

AllFusion ERwin Data Modeler, 188–89

advantages, 211

DBMS selection, 199ER modeling, 194modeling support, 209one-to-many relationships, 195schema generation, 198See also CASE tools

Armstrong axioms, 122–24Association rules, 179Associations, 37–39

binary, 38–39many-to-many, 39many-to-many-to-many, 100one-to-many, 39one-to-many-to-many, 99one-to-one, 39one-to-one-to-many, 98one-to-one-to-one, 97reflexive, 37ternary, 39See also Relationships

Attributes, 15–16assignment, 19attachment, 57classifying, 57


266 Index

Attributes (cont’d.)data types, 214defined, 15descriptor, 15extraneous, elimination, 125identifier, 15, 16multivalued, 15, 57of relationships, 17, 19ternary relationships, 28UML notation, 35, 36

Automatic summary tables (AST), 166

Binary associations, 38–39Binary recursive relationships,

90–92ER model, 90many-to-many, 90, 91, 92one-to-many, 90, 91one-to-one, 90–91UML model, 91See also Relationships

Binary relationships, 85–89many-to-many, 85, 87, 89, 104one-to-many, 85, 87, 89one-to-one, 85, 86, 88See also Relationships

Binomial multifractal distribution tree, 172

Boyce-Codd normal form (BCNF), 107, 115–16, 118

defined, 115strength, 115tables, 132, 133, 144See also Normal forms

Business intelligence, 147–86data mining, 178–85data warehousing, 148–66defined, 147OLAP, 166–78summary, 185

Business system life cycle, 188

Candidate keys, 109, 110

Candidate tablesfrom ER diagram transformation,

121–22normalization of, 118–22primary FDs, 119secondary FDs, 119See also Tables

Cardenas’ formula, 170, 171CASE tools, 1, 187–211

AllFusion ERwin Data Modeler, 188–89, 194, 195, 199, 211

application life cycle tooling integration, 202–4

basics, 192–96collaborative support, 200–201database generation, 196–99database support, 199–200data warehouse modeling, 207–9defined, 187design compliance checking, 204–5development cycle, 190DeZign for Databases, 190distribution development, 201–2ER/Studio, 190introduction, 188–91key capabilities, 191–92low-end, 192PowerDesigner, 188, 200, 203, 206,

210, 211QDesigner, 190Rational Data Architect, 188, 189,

193, 195, 198, 210reporting, 206–7script generation, 196summary, 211transformation table types, 192–93value, 190–91Visible Analyst, 190XML and, 209–10

Chen notation, 9, 10Class diagrams, 34–46

constructs illustration, 36for database design, 37–43


Index 267

defined, 33–34example, 43–46notation, 35–37packages, 43–46See also UML diagrams

Classesassociations, 37defined, 34entities vs., 34–35notation, 35organization, 43

Clustering, 74–81concepts, 75–76defined, 74grouping operations, 76–78illustrated, 76results, 81techniques, 78–81See also Entity-relationship (ER)

modelCollaboration

capabilities, 200–201concurrency control, 200support, 200–201

Columns, 2Comparison operators, 214Compliance checkers, 204–5Composition, 36–37

aggregation vs., 41defined, 37

Computer-aided software engineering tools. See CASE tools

Conceptual data modeling, 3–4, 8–10, 55–66

alternative notations, 20–22analysis results, 142diagram for UML, 142emphasis, 9notations, 21–22substeps, 55–56transformation to SQL, 6, 83–106See also Data modeling

Concurrency control, 200

Constraintsforeign, 217not null, 216primary, 217unique, 216–17

Dataindependence, 2items, 2semi-structured, 209–10summarizing, 165–66XML, 209–10

Database designclass diagrams for, 37–43compliance checking, 204–5knowledge of, 11logical, 3–6, 53–54, 139–45physical, 1, 6–8

Database life cycle, 3–8illustrated, 4implementation, 8logical design, 3–6physical design, 6–8requirements analysis, 3step-by-step results, 5, 7

Database management system (DBMS)

DDL, 8defined, 2relational, 2

DatabasesCASE tools support, 199–200defined, 2generating from designs, 196–99

Data definition language (DDL), 8, 190, 196, 213, 215–18

constraints, 216–17referential trigger actions, 217–18

Data manipulation language (DML), 8, 214, 218–29

aggregate functions, 222–24basic commands, 220–21intersection command, 221–22


268 Index

Data manipulation language (DML) (cont’d.)

referential integrity, 227–28select command, 219union command, 221–22update commands, 226–27usage, 218views, 228–29

Data mining, 178–85algorithms, 147, 178forecasting, 179–81text mining, 181–85

Data modelingconceptual, 3–4, 8–10, 55–66example, 61–66example illustration, 62–63, 65knowledge of, 11

Data warehouses, 148–66architecture, 148architecture illustration, 150bus, 164CASE tool modeling, 207–9core requirements, 149–51defined, 148dimensional data modeling,

152–54dimensional design process, 156dimensional modeling example,

156–65fact tables, 154, 155integration capabilities, 149–50life cycle, 151–52life cycle illustration, 153logical design, 152–66OLTP database comparison, 149overview, 148–52purpose, 155snowflake schema, 156, 157star schema, 154–56subject areas, 149

DB2 Cube Views, 207Decisions, 46, 47Decision support systems (DSSs), 148

Denormalization, 8Descriptors, 15Design patterns, 204DeZign for Databases, 190Dimensional data modeling, 152–54

example, 156–65process, 156See also Data modeling

Dimensionsdegenerate, 154determining, 159

Dimension tables, 154, 160Double exponential smoothing, 179,

181Drill-down, 155

Entities, 13–14classes vs., 34–35classifying, 56–57clustering, 74–81contents, 56–57defined, 13dominant, 78existence, 19–20intersection, 20in ternary relationships, 25–26transformation, 104weak, 16, 103

Entity clustersdefined, 75forming, 80higher-level, 80root, 75See also Clustering

Entity instances, 13Entity keys, 54Entity-relationship diagram (ERD)

notation, 194Entity-relationship (ER) model, 9,

13–31advanced constructs, 23–30aggregation, 25AllFusion Data Modeler, 194


Index 269

application to relational database design, 64–66

with Chen notation, 9, 10conceptual data model diagram,

141constructs, 13–22definition levels, 9entities, 13–14entity clustering, 74–81generalization and aggregation,

101global schema, 64–66illustrated, 14many-to-many binary relationship,

87one-to-many binary relationship,

87one-to-one binary relationship, 86Rational Data Architect, 193relationships, 14–15, 16–20,

25–29simple form example, 9–10subtypes, 23–24summary, 30supertypes, 23–24of views based on requirements,

61–63ER/Studio, 190Exclusion constraint, 29Exclusive OR, 29Executive information systems (EIS),

148Existence

defined, 19mandatory, 20optional, 20, 64

Exponential smoothing, 179–80defined, 179double, 179, 181triple, 180, 182See also Forecasting

EXtensible Markup Language. See XML

Fact tablesdefined, 154granularity, 155See also Data warehouses

Fifth (5NF) normal form, 127, 133–37

defined, 133membership algorithm satisfaction,

134See also Normal forms

Files, 2First normal form (1NF), 109Flows, 46, 47

defined, 46guard condition, 46illustrated, 47

Forecasting, 179–81defined, 179double exponential smoothing,

179, 181exponential smoothing, 179–80least squares line approach, 179reliability, 180triple exponential smoothing, 180,

182See also Data mining

Foreign key constraints, 217Forks, 46, 47

defined, 47–48example, 48–49illustrated, 47

Fourth (4NF) normal form, 127, 129–33

goal, 129tables, 131tables, decomposing, 132–33See also Normal forms

Functional dependence, 111Functional dependencies (FDs)

defined, 6in n-ary relationships, 29primary, 118, 119secondary, 118, 119


270 Index

Generalization, 23completeness constraint, 24defined, 37disjointness constraint, 24ER model, 101hierarchies, 24, 57–58UML constructs, 40UML model, 102

Global schema, 3Grouping operations, 76–78

application, 77–78defined, 76illustrated, 77types, 77

Guard condition, 46–47

HRU, 173–75Hypercube lattice structure, 173

IDEFIX notation, 21Identifiers

defined, 15internal, 16

Inclusive OR, 29Industry Warehouse Studio (IWS),

207, 208Information Engineering Workbench

(IEW), 20Informix Data Server (IDS), 211Intersection command, 221–22Intersection entities, 20

Join dependencies (JDs), 133, 136Joins, 46, 47

defined, 48example, 49–50illustrated, 47SQL, 224–26

Keyscandidate, 109, 110entity, 54equivalent, merge, 126

primary, 42, 109superkeys, 109, 123–24

Logical design, 3–6, 53–54CASE tools for, 187–211conceptual data model diagram,

141, 142conversion, 197–98data warehouses, 152–66example, 139–45problems, 140requirements specification, 139–40

Logical model, 6Logical operators, 214

Mandatory existence, 20Many-to-many relationships, 14

binary, 85, 87, 89binary recursive, 90, 91, 92defined, 18transformation, 104See also Relationships

Many-to-many-to-many ternary associations, 100

Many-to-many-to-many ternary relationships, 96, 137

Materialized views, 167metadata, 177selection of, 173–76

Measures, 159Multidimensional databases (MDDs),

151Multiplicities

illustrated, 38many-to-many, 39one-to-many, 39one-to-one, 39

Multivalued attributes, 15, 57Multivalued dependencies (MVDs),

127–29defined, 127–28inference rules, 129nontrivial, elimination, 129


Index 271

N-ary relationships, 28–29, 92–101ER model, 93–96UML, 42, 97–100variations, 92

Nodes, 46, 47Nonredundant cover

partitioning of, 125–26search for, 125

Normal formsBCNF, 107, 115–16, 118defined, 108fifth (5NF), 127, 133–37first (1NF), 109fourth (4NF), 127, 129–33second (2NF), 111–13third (3NF), 113–15, 118

Normalization, 6, 107–38accomplishment, 108–9candidate tables derived from ER

diagrams, 118–22defined, 108denormalization, 8fundamentals, 107–16

Normalized tablesdesign, 116–18minimum set, 127

Not null constraints, 216

Objectsconstraint-related, 78defined, 34dominant, 78

One-to-many relationships, 14binary, 85, 87, 89binary recursive, 90, 91defined, 18See also Relationships

One-to-many-to-many ternary associations, 99

One-to-many-to-many ternary relationships, 95

One-to-one relationships, 14binary, 85, 86, 88

binary recursive, 90–91defined, 18See also Relationships

One-to-one-to-many ternary associations, 98

One-to-one-to-many ternary relationships, 94

One-to-one-to-one ternary associations, 97

One-to-one-to-one ternary relationships, 93

Online Analytical Processing (OLAP), 147, 166–78

applications, 8defined, 166materialized view selection, 173–76optimization, 169, 170overview, 169–70query optimization, 177–78view maintenance, 176–77views, 166, 170–72, 178

Operationsdefined, 35notation, 35, 36

Optional existence, 20, 64

Packages, 43–46contents, expanding, 45defined, 43relationships, 44

Physical design, 6–8Physical model, 6Polynomial Greedy Algorithm (PGA),

174–75PowerBuilder, 203PowerDesigner, 188, 211

merge process, 200plug-in to Sybase PowerBuilder, 203reporting features, 206XML adoption, 210

Preintegration analysis, 67–68Primary FDs, 118, 119

candidate table, 119


272 Index

Primary FDs (cont’d.)from ER diagram, 120See also Functional dependencies

(FDs)Primary keys, 109

constraints, 217UML constructs, 42

QDesigner, 190Query optimization, 177–78

Rational Data Architect, 188, 189, 211automatic computing linkages, 198ER modeling, 193property editing, 195XML adoption, 210See also CASE tools

Rational MultiSite software, 201, 202Redundant relationships, 58–60

analyzing, 58illustrated, 59See also Relationships

Referential integrity, 30, 227–28Reflexive associations, 37Relational databases (RDBs), 150, 151Relationships, 14–15

attributes, 17, 19binary, 85–89binary recursive, 90–92cardinality, 18connectivity, 17, 18–19defined, 14defining, 58–61degree, 16–17entity existence in, 19–20many-to-many, 14, 18, 39many-to-many-to-many, 96, 137multiple, 103names, 15n-ary, 28–29, 92–100one-to-many, 14, 18, 39one-to-many-to-many, 95one-to-one, 14, 18, 39

one-to-one-to-many, 94one-to-one-to-one, 93packages, 44redundant, 58–60roles, 15ternary, 16, 25–28, 60–61, 92–100

Reporting, 206–7elements, 206–7PowerDesigner, 206See also CASE tools

Requirements analysis, 3, 54–55defined, 54objectives, 55results, 140

Requirements specification, 139–40Reverse engineering, 6Roles, 15Rows, 2

Schemascommonality, 72comparison, 68conceptual integration, 67conformation, 68–69diversity, 66merging, 69restructuring, 69structural conflicts, 68, 71

Secondary FDs, 118, 119candidate table, 119determining, 120from requirements specification,

121See also Functional dependencies

(FDs)Second normal form (2NF), 111–13

functional dependence, 111tables, 112, 113See also Normal forms

Select command, 219Semi-structured data, 209–10Set operators, 214Snowflake schema, 156, 157


Index 273

Software Development Platform, 204Specialization, 24SQL, 213–29

advanced value expressions, 214–15

aggregate functions, 222–24basics, 213–29comparison operators, 214conceptual data model

transformation to, 6, 83–106

constructs, 83–85data types, 214DDL, 215–18defined, 213DML, 218–29joins, 224–26logical operators, 214names, 214null values, 84–85object definitions, 215referential integrity, 227–28set functions, 214set operators, 214subqueries, 224–26update commands, 226–77

SQL tables, 83, 84with embedded foreign key, 84from relationships, 84with same information content, 84

Star schema, 154–56defined, 154dimensions, 162for estimating process, 160example, 154for job costing daily snapshot, 166for job costing process, 165for productivity tracking process,

163queries, 154–55for scheduling process, 162See also Data warehouses

Stereotypes, 43

Subqueries, 224–26Subtypes, 23–24

defined, 23entities, 24illustrated, 23

Superkeysdefined, 109rules, 123–24

Supertypes, 23–24defined, 23illustrated, 23

TablesBoyce-Codd normal form (BCNF),

132, 133, 144candidate, 118–22decomposition of, 145fourth (4NF) normal form, 131,

132–33merge of, 126normalized, 6, 116–18reduction of, 145second normal form (2NF), 112,

113third (3NF) normal form, 114,

122–27Ternary associations, 39Ternary relationships, 16, 25–28,

92–100attributes, 28connectivity, 61defining, 60entities in, 25–26ER model, 93–96foreign key constraints and, 92forms, 28illustrated, 26–27, 60many-to-many-to-many, 96, 100with multiple interpretations, 130one-to-many-to-many, 95, 99one-to-one-to-many, 94, 98one-to-one-to-one, 93, 97requirement, 25


274 Index

Ternary relationships (cont’d.)transformation, 105UML, 97–100varieties, 92See also Relationships

Text mining, 181–85verbatim description, 184verbatim description information,

184word mapping, 184–85See also Data mining

Third (3NF) normal form, 113–15, 118

defined, 114synthesis algorithm, 124–25tables, 114tables, minimum set, 122–27See also Normal forms

Transformation, 6, 83–106entity, 104ER-to-SQL example, 105many-to-many binary relationship,

104rules, 83–85steps, 103–5summary, 106ternary relationship, 105

Triple exponential smoothing, 180, 182

UML diagramsactivity, 34, 46–50class, 33–46conceptual data model, 142ER models vs., 33generalization and aggregation, 102many-to-many binary relationship,

89one-to-many binary relationship,

89one-to-one binary relationship, 88organization, 51size, 50

textual descriptions, 50–51type, 33

Unified Modeling Language (UML), 9, 33–51

aggregation constructs, 41defined, 33generalization constructs, 40n-ary relationship, 42primary key constructs, 42relationship types, 38stereotypes, 43summary, 51usage rules, 50–51See also UML diagrams

Union command, 221–22Unique constraints, 216–17Update anomaly, 112Update commands, 226–27

View integration, 5–6, 66–74defined, 66example, 69–74illustrated, 70, 71, 72–73merged schema, 72–73preintegration analysis, 67–68process, 74schema comparison, 68schema conformation, 68–69schema merge/restructure, 69techniques, 69type conflict, 71

Views, 166coordinates of, 178creating, 229dynamic selection, 176ER modeling based on

requirements, 61–63exponential explosion, 167–69size estimation, 170–72SQL, 228–29state estimation, 170–72uses, 228

Visible Analyst, 190


Index 275

Weak entities, 16, 103

XML, 207, 209–10defined, 209documents, 210schema, 209standards, 209


database modeling and design logical design 4th edition data management systems

Economy & Finance