rdb design 5

Upload: fieast

Post on 09-Oct-2015

17 views

Category:

Documents


0 download

DESCRIPTION

SAD

TRANSCRIPT

Relational Database Design

Relational Database DesignBill WoolfolkPublic Health SciencesUniversity of [email protected] definition of modern relational databaseUnderstand and be able to apply a practical method for designing databasesRecognize and avoid common pitfalls of database design2Whats a database?A collection of logically-related information stored in a consistent fashionPhone bookBank records (checking statements, etc)Library card catalogSoccer team rosterThe storage format typically appears to users as some kind of tabular list (table, spreadsheet)3What Does a Database Do?Stores information in a highly organized mannerManipulates information in various ways, some of which are not available in other applications or are easier to accomplish with a databaseModels some real world process or activity through electronic meansOften called modeling a business processOften replicates the process only in appearance or end result4Databases and the Systems which manage themModern electronic databases are created and managed through means of RDBMS: Relational DataBase Management SystemsAn individual data storage structure created with an RDBMS is typically called a databaseA database and its attendant views, reports, and procedures is called an application5Relational DatabaseManagement SystemsLow-end, proprietary, specific purposeEmail: Outlook, Eudora, MulberryBibliographic: Ref. Mgr., EndNote, ProCiteMid-levelMicrosoft Access, Lotus Approach, Borlands ParadoxMore or less total control of design allows custom buildsHigh-endOracle, Microsoft SQL Server, Sybase, IBM DB2Professional level DBs: Banks, e-commerce, secureAmazon.com, Ebay.com, Yahoo.com7Problems with Bad DesignEarly computers were slow and had limited storage capacityRedundant or repeating data slowed operations and took up too much precious storage spacePoor design increased chance of data errors, lost or orphaned information

8Early computers didnt have hard drives; data was fed from physical storage, cards, tapes, and immediately processedTape storage, swapping tapes to complete a data runBenefits of Good DesignComputers today are faster and possess much larger storage devicesRigid structure of modern relational databases helped codify problems and solutionsDesign problems are still possible, because the DBMS software wont protect you from poor practices Good design still increases efficiency of data processes, reduces waste of storage, and helps eliminate data entry errors9Its a good thing: RDMS require greater hardware overhead because they are keeping track of more things.

Codds RulesEdgar F. CoddMathematician and Researcher at IBMDevised the relational data model in 1970Published 12 rules in 1985 defining ideal relational database, added 6 more in 1990

E. F. Codd: A Relational Model of Data for Large Shared Data Banks. CACM 13(6): 377-387 (1970)(http://www.acm.org/classics/nov95/toc.html)Codd, E. (1985). "Is Your DBMS Really Relational?" and "Does Your DBMS Run By the Rules?" ComputerWorld, October 14 and October 21.

10[HANDOUT: CODDS RULES]

Codds Rules caused somewhat of a revolution or revelation among database software vendors and their in-house programmers, most of whom began re-writing their DMBS software to comply with most, if not all, of Codds Rules.

Some even advertised how many of Codds Rules they met.

These rules are still the basis for relational database systems today.Modification AnomaliesA search for General Tool Co. would miss General Tool and General Toll. A case-sensitive search for Totally Toys would miss TOTALLY TOYSCustomerOrderNumItemNumItemGeneral Tool074562246Pentium ComputerGeneral Toll086223145HP PrinterGeneral Tool Co.08622396717 monitorTotally Toys067552246Pentium computerTOTALLY TOYS081343145Hewlett-Packard PrinterXYZ Inc.090100446Dot Matrix PrinterCustomers_Orders_InventoryInsertion AnomaliesHow would you enter a new item into your inventory if no one had ordered it yet?CustomerOrderNumItemNumItemGeneral Tool074562246Pentium ComputerGeneral Toll086223145HP PrinterGeneral Tool Co.08622396717 monitorTotally Toys067552246Pentium computerTOTALLY TOYS081343145Hewlett-Packard PrinterXYZ Inc.090100446Dot Matrix PrinterCustomers_Orders_InventoryDeletion AnomaliesIf you wanted to stop selling dot matrix printer and remove it from your inventory, you would have to delete the order and customer info for XYZ Inc.CustomerOrderNumItemNumItemGeneral Tool074562246Pentium ComputerGeneral Toll086223145HP PrinterGeneral Tool Co.08622396717 monitorTotally Toys067552246Pentium computerTOTALLY TOYS081343145Hewlett-Packard PrinterXYZ Inc.090100446Dot Matrix PrinterCustomers_Orders_InventoryThe FixOrderNumItemNum067552246074562246081343145086223145086223967090100446CustomerNumOrderNum782209010875506755875508134912307456912308622CustomerNumCustomer7822XYZ Inc.8755Totally Toys9123General Tool Co.ItemNumItem0446Dot Matrix Printer2246Pentium Computer3145Hewlett-Packard printer396717 monitorOrder_ItemsOrdersCustomersProducts14So, how to make sure you dont make these mistakes? Common sense and trial & error will take you part of the way, but its better to follow a logical design process.The Design ProcessIdentify the purpose of the databaseReview existing dataMake a preliminary list of fieldsMake a preliminary list of tables and enter fieldsIdentify the key fieldsDraft the table relationshipsEnter sample data and normalize the data/tablesReview and finalize the design

15[HANDOUT: EXERCISE 1]

Database ModelingRefers to various, more-or-less formal methods for designing a databaseSome provide precision steps and toolsEx.: Entity-Relationship (E-R) Modeling Widely used, especially by high-end database designers who cant afford to miss thingsFairly complex processExtremely precise1. Identify purpose of the DBClients can tell you what information they want but have no idea what data they need.

We need to keep track of inventoryWe need an order entry systemI need monthly sales reportsWe need to provide our product catalog on the Web

Be sure to Limit the Scope of the database.

17Quite often, the stated intention implies data needs far beyond the clients knowledge. Be sure to offer or question extension of the design to other areas.

Example: Tracking inventory implies adjusting inventory in stock every time there is a sale, thus implying that some method of tracking sales is also needed.

Client may say We have a database already for that, which implies that you the designer may need to tap into the existing DB in some manner.

Or client may say We dont have the budget for that this year; just do the inventory tracking part and well keep track of sales manually. thus limiting the scope of your design2. Review Existing DataElectronicLegacy database(s)SpreadsheetsWeb formsManualPaper formsReceipts and other printed output3. Make Preliminary Field ListMake sure fields exist to support needsEx. if client wants monthly sales reports, you need a date field for orders.Ex. To group employees by division, you need a division identifierMake sure values are atomicEx. First and Last names stored separatelyEx. Addresses broken down to Street, City, State, etc.Do not store values that can be calculated from other valuesEx. Age can be calculated from Date of Birth4. Make Preliminary Tables(and insert the fields into them)Each table holds info about one subjectDont worry about the quantity of tablesLook for logical groupings of informationUse a consistent naming convention

Naming ConventionsRules of thumbTable names must be unique in DB; should be pluralField names must be unique in the table(s)Clearly identify table subject or field dataBe as brief as possibleAvoid abbreviations and acronymsUse less than 30 characters, Use letters, numbers, underscores (_)Do not use spaces or other special characters21Uniqueness of field names applies to the table they are in; fields in different tables can have the same name and linked fields usually should so they are easily identifiedNaming Conventions (contd)Leszynski Naming Convention (LNC)Example: tblEmployees, qryPartNumtbl, qry = tagEmployees, PartNum = basename

LNC at Microsoft Developers Network

22[HANDOUT: LESZYNSKI NAMING CONVENTION]5. Identify the Key FieldsPrimary Key(s)Can never be Null; must hold unique valuesAutomatically indexed in most RDBMSsValues rarely (if ever) changeTry to include as few fields as possibleMulti-field Primary KeyCombination of two or more fields that uniquely identify an individual recordCandidate KeyField or fields that qualify as a primary keyImportant in Third and Boyce-Codd Normal Forms

6. Identify Table RelationshipsBased on business rules being modeledExamples:each customer can place many ordersall employees belong to a departmenteach TA is assigned to one course

24Historical note: Relational as in Relational Database has nothing to do with relationship as in table relationships. Codd was a mathematician, and devised his rules for modern databases based on mathematical set theory. In set theory, when two groups of numbers have a correspondence of some kind, this is called a relation, and Codd named this type of database relational because the database storage structure follows some of the same rules as mathematical sets, not because we relate tables together.Relationship TerminologyRelationship TypeOne-to-one: expressed as 1:1One-to-Many: expressed as 1:N or 1:M or 1:Many-to-Many: expressed as N:N or M:MPrimary or Parent TableTable on the left side of 1:N relationshipRelated or Child TableTable on the right side of 1:N relationshipRelational SchemaDiagram of table relationships in databaseRelationship Terminology (contd)JoinDefinition of how related records are returnedJoin LineVisual relationship indicators in schemaKey fieldsPrimary Key: the linking field on the one side of a 1:N relationshipForeign Key: the primary key from one table that is added to another table so the records can be relatedNon-Key Fields: any field that is not part of a primary key, multi-field primary key, or foreign keyOne-to-One (1:1)Each record in Table A relates to one, and only one, record in Table B, and vice versa.Either table can be considered the Primary, or Parent TableCan usually be combined into one table, although may not be most efficient design

One-to-Many (1:N)Each record in Table A may relate to zero, one or many records in Table B, but each record in Table B relates to only one record in Table A.The potential relationship is whats important: there might be no related records, or only one, but there could be many.The table on the One (or left) side of a 1:N relationship is considered the Primary Table.Many-to-Many (N:N)A record in Table A can relate to many records in Table B, and a record in Table B can relate to many records in Table A.Most RDBMSs do not support N:N relationships, requiring the use of a linking (or intersection or bridge) table that breaks the N:N relationship down into two 1:N relationships with the linking table being on the Many side of both new relationships.Relational SchemaTable 1Field1_1Field1_2Field1_3Field1_4Table 2Field2_1Field1_1Field2_2Field2_31N7. NormalizationNormal Forms (NF): design standards based on database design theoryNormalization is the process of applying the NFs to table design to eliminate redundancy and create a more efficient organization of DB storage. Each successive NF applies an increasingly stringent set of rules31Much of what well talk about now and much that youve already run into in your own experience will tell you that common sense can avoid many of these problems. At the very least, some of the earlier steps in the design process will obviate or prevent the occurrence of these problems later in the process.

But the normal forms are your safety net. If you arent sure about whether something belongs in a table or not, run it through the normal forms to find out. Sometimes the problem isnt in the table youre currently analyzing, but in one at which youve already looked.First Normal Form (1NF)A table is in first normal form if there are no repeating groups.Repeating Groups : a set of logically related fields or values that occur multiple times in one record1: non-atomic value, or multiple values, stored in a field2: multiple fields in the same table that hold logically similar valuesSample 1NF Violation - 1EmployeeIDNameProjectTimeEN1-26Sean OBrien30-452-T3, 30-457-T3, 32-244-T30.25, 0.40, 0.30EN1-33Amy Guya30-452-T3, 30-382-TC, 32-244-T30.05, 0.35, 0.60EN1-35Steven Baranco30-452-T3, 31-238-TC0.15, 0.80Employee_Projects_Time33Problem here is non-atomic values, occurring in different ways. The Name field actually holds two sub-atomic values, First Name and Last Name. Atomic, logically similar values are stored in a single field in each record, as with Project and Time. This is an example of one of the two major types of repeating groups or repeating data. Sample 1NF Violation - 2EmpIDLast NameFirstNameProj1Time1Proj2Time2EN1-26OBrienSean30-452-T30.2530-457-T30.40EN1-33GuyaAmy30-452-T30.0530-328-TC0.35Employee_Projects_Time34Problem here is repeating, logically similar, fields, the second major type of repeating data.Table has to expand each time an employee is assigned to a new project. Logically similar fields proliferate as this happens.Why is this bad? You really dont want the client mucking around in the structure of the database, and youd prefer not to be called back every time they need to add another project, right?Fortunately, we can solve this problem by splitting up the table into multiple tables that hold only logically related fields.Tables in 1NF*EmployeeIDLastNameFirstNameEN1-26OBrienSeanEN1-33GuyaAmyEN1-35BarancoSteven*ProjNumEmployeeIDTime30-328-TCEN1-330.3530-452-T3EN1-260.2530-452-T3EN1-330.05EmployeesEmployees_ProjectsSecond Normal Form (2NF)A table is in 2NF if it is in 1NF and each non-key field is functionally dependent on the entire primary key.Functional dependency: a relationship between fields such that the value in one field determines the one value that can be contained in the other field.Determinant: a field in which the value determines the value in another field.ExampleAirport CityDulles Washington, DCSample 2NF Violation*EmpIDLnameFname*ProjNumProjTitleEN1-25OBrienSean30-452-T3STAR ManualEN1-25OBrienSean30-457-T3ISO ProceduresEN1-25OBrienSean31-124-T3Employee HandbookEN1-33Guya Amy30-452-T3STAR ManualEN1-33GuyaAmy30-482-TCWeb siteEmployees_Projects37In a well-designed database, the only data that is duplicated is in key fields used to connect tables. In this case, our connecting fields are EmployeeID and ProjectNum. The name fields and Project title field are redundant here; this table should be split into three: Employees, Employees_Projects, and Projects. Employees_Projects should hold only those two key fields (remember: no many to many relationships).Tables in 2NF*EmployeeIDLastNameFirstNameEN1-26OBrienSeanEN1-33GuyaAmyEmployees*EmployeeID*ProjNumEN1-2630-452-T3EN1-3330-457-T3Employees_Projects*ProjNumTitle30-452-T3STAR manual30-457-T3ISO procedureProjectsThird Normal Form (3NF)A table is in 3NF when it is in 2NF and there are no transitive dependencies.Transitive Dependency: a type of functional dependency in which the value of a non-key field is determined by the value in another non-key field and that field is not a candidate key.Sample 3NF Violation*ProjNumProjTitleProjMgrPhone30-452-T3STAR ManualGarrison275630-457-T3ISO ProceduresJacanda295430-482-TCWeb SiteFriedman284631-124-T3STAR prototypeGarrison275635-272-TCOrder SystemJacanda2954Projects_Managers40Tables in 3NF*ProjNumProjTitleManager30-452-T3STAR manualGarrison30-457-T3ISO proceduresJacandaProjects*ManagerPhoneGarrison2846Jacanda2756Project ManagersBoyce-Codd Normal Form (BCNF)A table is in BCNF when it is in 3NF and all determinants are candidate keys.Developed to cover situations that 3NF did not address.Applies to situations where you have overlapping candidate keys.Sample Business RulesBusiness Rules:Each course can have many studentsEach student can take many coursesEach course can have multiple teaching assistants (TAs)Each TA is associated with only one courseFor each course, each student has one TASample BCNF ViolationCourseNumStudentTAENG101JonesClarkENG101GraysonChenENG101SamaraChenMAT350GraysonPowersMAT350JonesOSheaMAT350BergPowersCourse_Students_TAs44What are the candidate keys?Clearly, there is no single field primary key, leaving three possibilities:CourseNum+StudentStudent+TACourseNum+TA

CourseNum + Student will work and satisfy 3NF because this combination determines TA.Student + TA is another candidate key, but does not satisfy 3NF because it is not a determinant for CourseNum, which only depends on the value of TA.CourseNum+TA doesnt satisfy 3NF either, because TA is determined by one or the other, but not both

Bingo! Weve hit upon an example of the reason BCNF was devised: the combination of Student + TA can not be considered a candidate key.

Why is this bad? You cant assign a TA to a course until students are enrolled in it (insertion anomaly). And if you changed the name of a TA, youd have to change it in multiple places (modification anomaly).Tables in BCNF*Student*TAJonesClarkGraysonChenStudents*CourseNum*TAENG101ClarkMAT350ChenTAs*CourseNum*StudentENG101JonesMAT350GraysonCoursesFourth Normal Form (4NF)A table is in 4NF when it is in BCNF and there are no multi-valued dependencies.Multi-valued Dependency: occurs when, for each value in field A, there is a set of values for field B and a set of values for field C, but B and C are not related.Occurs when the table contains fields that are not logically related.Sample 4NF Violation - 1*Movie*Star*ProducerOnce Upon a TimeJudy GarlandAlfred BrownOnce Upon a TimeMickey RooneyAlfred BrownOnce Upon a TimeJudy GarlandMuriel HemingwayOnce Upon a TimeMickey RooneyMuriel HemingwayMoonlightHumphrey BogartAlfred BrownMoonlightJudy GarlandAlfred BrownMovies47A movie can have more than one star and more than one producer. A star can be in more than one movie, and a producer can produce more than one movie. The primary key would have to include all three fields so this table would be in BCNF. Still we have unnecessarily repeated values with ensuing data maintenance problems and there would be problems with deletion anomalies.

Why? Movie is the determinant for both Star and Producer, but Star and Producer arent logically related. Were forced to record every possible combination of Movie+Star+Producer in order not to miss anything. That causes repeating values in all fields.

We need a separate table for each of these logical relationships.Tables in 4NF - 1*Movie*StarOnce Upon a TimeJudy GarlandOnce Upon a TimeMickey RooneyMoonlightHumphrey BogartMoonlightJudy GarlandStars*Movie*ProducerOnce Upon a TimeAlfred BrownOnce Upon a TimeMuriel HemingwayMoonlightAlfred BrownProducersSample 4NF Violation - 2Projects_EquipmentDeptCodeProjNumProjMgrIDEquipPropIDIS36-272-TCEN1-15CD-ROM657ISVGA monitor305AC36-152-TCEN1-15ACDot matrix printer358ACCalculator w/tape239TW30-452-T3EN1-10486 PC275TW30-457-T3EN1-15TW31-124-T3EN1-15Laser Printer10949Another way to spot a 4NF violation is to look for a pattern of repeating Null values. Here again the fields in the table are not all logically related. And again, the answer is to split this table into two tables which hold only logically related values.Tables in 4NF - 2*PropIDEquipDeptCode657CD-ROMIS305VGA monitorIS358Dot matrix printerACEquipment*ProjNumProjMgrIDDeptCode30-452-T3EN1-15IS30-457-T3EN1-15AC35-152-TCEN1-10TWProjects50One table to hold Equipment Assigned to Departments and one to hold Projects and Depts. Assuming a project can have only one manager and be associated with only one department, the tables should look similar to these.

Beware the business rules, though! A project might involve more than one department or manager and youd have to figure out a primary key based on those rules. You could then be violating another normal form entirely.Fifth Normal Form (5NF)A table is in 5NF when it is in 4NF and there are no cyclic dependencies.Cyclic Dependency: occurs when there is a multi-field primary key with three or more fields (ex. A, B, C) and those fields are related in pairs AB, BC and AC.Can occur only with a multi-field primary key of three or more fields51We dont run across this one too often; after all, how many times do you find or need tables with multi-field primary keys consisting of three or more fields?

Lets look at an example.

Sample 5NF Violation*Buyer*Product*CompanyChrisJeansLeviChrisJeansWranglerChrisShirtsLeviLoriJeansLeviBUYING52The primary key here consists of all three fields. The problem is that you have to add a record for every buyer who buys a product for every company that makes that product or they cant buy from them (in the reality of the database, anyway).

Big deal, right?Do the mathOur sample is two buyers, two products and two companies, so

2 x 2 x 2 = 8 total records

But, what if our store has 20 buyers, 50 products and 100 companies?

20 x 50 x 100 = 100,000 total records53Can you say Cartesian Product?A Tempting Solution*Buyer*ProductChrisJeansChrisShirtsLoriJeansBuyers*Product*CompanyJeansWranglerJeansLeviShirtsLeviProducts54The problem here occurs when you join these two tables on the Product field: you get a record that is not part of your original data set (it would say that Lori buys jeans from Wrangler).The Correct Solution*Buyer*ProductChrisJeansChrisShirtsLoriJeansBuyers*Product*CompanyJeansWranglerJeansLeviShirtsLeviProducts*Buyer*CompanyChrisWranglerChrisLeviLoriLeviCompanies55Join Buyers to Products by Product, and then join the result with Companies by Buyer and Company, and you get the original data set.Check the Math, AgainIf our company has 20 buyers, 50 products and 100 companies?Buyers = 20 x 50 = 1000Products = 50 x 100 = 5000Companies = 20 x 100 = 2000

8,000 total records instead of 100,000!56Id much rather maintain 8,000 records and my client probably would, too.

8. Finalizing the DesignDouble-check to ensure good, principle-based designEvaluate design in light of business model and determine desired deviations from design principlesProcess efficiencySecurity concernsThats it for Table DesignWatch for repeating values and fieldsCheck against the Normal FormsMake new tables when necessaryRe-check all tables against the NFsRemember the business rulesUse common sense, but check anyway!Ensuring Data IntegrityPlacing constraints on how and when and where data can be entered

Done after or along with table design

Part of design process because many constraints are established at the database and table levelsReferential IntegrityTrue relational databases support Referential Integrity: every non-null foreign key value must match an existing primary key value.In other words, every record in a related table must have a matching record in the primary table.Preserves the validity of foreign key values.Enforced at database level.60Why is this important? Referential Integrity helps ensure that the database contains valid and usable values and records by preserving the connection between tables. Without it, table relationships quickly become meaningless and queries return unreliable results. The most common problem in the absence of referential integrity is the creation of orphan records: the primary key value is changed, causing the matching of the related records to fail.

Default in most RDBMSs is for RefInt to be turned off, probably because the software cant tell from the table design whether you want it turned on or not.

So, what happens when you want to change the value on one side of a set of related records? RefInt in its absolute form wont allow this, so

Cascading UpdatesWhen a primary key value changes, Cascade Update changes the corresponding values in the related records, so no records get orphaned.Usually only one level deepForeign key is not usually primary key of related table (except in 1:1 relationships) hence no other tables are usually related to it61Cascade Update only works from the Primary Key (left or primary table) side, because the assumption is that if you are changing the value on side of the relation, you will want to change it on the other side, too.Cascade DeletesWhen a primary table record is deleted, all matching records in any related table are also deletedCan propagate through multiple tables if Cascade Delete is turned on in all relationships between those tablesAnother protection against orphan records, only this time by eradicating them instead!62Even though most RDBMSs will warn you that a Cascade Update or Delete is about to occur, my personal preference is to always turn on Cascade Update and always leave Cascade Delete turned off. Deleting records can be handled programmatically usually with little more work.Levels of EnforcementReferential Integrity enforced at database level because it affects relationship between two tables.Many other business rules enforced at field and table level to ensure data integrity.Business rule implementation should be documented: how and where it is enforced in the design.Some rules cant be enforced at table or field level; must be enforced in the application level.Testing of Business RulesAlways test business rule implementationWhat happens when rule is met?What happens when rule is violated?Not much good as a data entry constraint if it doesnt constrain properlyGood application or interface design will provide feedback when user violates a constraint or ruleField Level IntegrityConstraining by use of field propertiesData type: text, number, Yes/No, Date/TimeField sizeFormatsEntry and editing constraintsRequiredIndexed, with or without duplicatesInput masksDefault valueValidation Rule

Table Level IntegrityField ComparisonsCompare value in one field to value in anotherComparison performed before record is savedViolations could display an error message or force constraint of available valuesValidation or Lookup TablesStore generally static set of valuesStored values used to populate new records to ensure accuracy of data entryDocumentationA good design deserves good documentationData Dictionary for database/table designTable and field namesTable and field propertiesRelationships, including primary and foreign keysIndexesProvide reasons for design features, especially if they intentionally violate normal design principles