the data warehousing ebusiness dba...

The Data Warehouse eBusiness DBA Handbook

Donald K. Burleson Joseph Hudicka

William H. Inmon Craig Mullins Fabian Pascal

The Data Warehouse eBusiness DBA Handbook

By Donald K. Burleson, Joseph Hudicka, William H. Inmon, Craig Mullins, Fabian Pascal Copyright © 2003 by BMC Software and DBAzine. Used with permission. Printed in the United States of America. Series Editor: Donald K. Burleson Production Manager: John Lavender Production Editor: Teri Wade Cover Design: Bryan Hoff Printing History: August, 2003 for First Edition Oracle, Oracle7, Oracle8, Oracle8i and Oracle9i are trademarks of Oracle Corporation. Many of the designations used by computer vendors to distinguish their products are claimed as Trademarks. All names known to Rampant TechPress to be trademark names appear in this text as initial caps. The information provided by the authors of this work is believed to be accurate and reliable, but because of the possibility of human error by our authors and staff, BMC Software, DBAZine and Rampant TechPress cannot guarantee the accuracy or completeness of any information included in this work and is not responsible for any errors, omissions or inaccurate results obtained from the use of information or scripts in this work. Links to external sites are subject to change; DBAZine.com, BMC Software and Rampant TechPress do not control or endorse the content of these external web sites, and are not responsible for their content. ISBN 0-9740716-2-5

iii The Data Warehousing eBusiness DBA Handbook

Table of Contents Conventions Used in this Book .....................................................ix About the Authors ...........................................................................xi Foreword..........................................................................................xiii

Chapter 1 - Data Warehousing and eBusiness....................... 1 Making the Most of E-business by W. H. Inmon........................1

Chapter 2 - The Benefits of Data Warehousing.....................9 The Data Warehouse Foundation by W. H. Inmon ....................9 References........................................................................................ 18

Chapter 3 - The Value of the Data Warehouse .................... 19 The Foundations of E-Business by W. H. Inmon .................... 19 Why the Internet? ........................................................................... 19 Intelligent Messages........................................................................ 20 Integration, History and Versatility.............................................. 21 The Value of Historical Data........................................................ 22 Integrated Data ............................................................................... 23 Looking Smarter ............................................................................. 26

Chapter 4 - The Role of the eDBA....................................... 28 Logic, e-Business, and the Procedural eDBA by Craig S. Mullins .............................................................................................. 28 The Classic Role of the DBA ....................................................... 28 The Trend of Storing Process With Data................................... 30 Database Code Objects and e-Business ...................................... 32 Database Code Object Programming Languages...................... 34 The Duality of the DBA................................................................ 35 The Role of the Procedural DBA ................................................ 37 Synopsis............................................................................................ 38

Chapter 5 - Building a Solid Information Architecture ....... 39

iv The Data Warehousing eBusiness DBA Handbook

How to Select the Optimal Information Exchange Architecture by Joseph Hudicka.......................................................................... 39

Introduction..................................................................................... 39 The Main Variables to Ponder...................................................... 40

Data Volume............................................................................... 40 Available System Resources ..................................................... 41 Transformation Requirements ................................................. 41 Frequency .................................................................................... 41

Optimal Architecture Components ............................................. 42 Conclusion ....................................................................................... 42

Chapter 6 - Data 101 ............................................................. 43 Getting Down to Data Basics by Craig S. Mullins .................... 43 Data Modeling and Database Design.......................................... 43 Physical Database Design.............................................................. 45 The DBA Management Discipline .............................................. 46 The 17 Skills Required of a DBA................................................. 47 Meeting the Demand ..................................................................... 51

Chapter 7 - Designing Efficient Databases ......................... 52 Design and the eDBA by Craig S. Mullins ................................. 52 Living at Web Speed ...................................................................... 52 Database Design Steps................................................................... 54 Database Design Traps.................................................................. 57 Taming the Hostile Database ....................................................... 59

Chapter 8 - The eBusiness Infrastructure............................ 61 E-Business and Infrastructure by W. H. Inmon........................ 61

Chapter 9 - Conforming to Your Corporate Structure ......... 68 Integrating Data in the Web-Based E-Business Environment by W. H. Inmon ................................................................................... 68

Chapter 10 - Building Your Data Warehouse ...................... 77 The Issues of the E-Business Infrastructure by W. H. Inmon 77 Large Volumes of Data.................................................................. 79 Performance .................................................................................... 83 Integration........................................................................................ 85

Table of Contents v

Addressing the Issues..................................................................... 87

Chapter 11 - The Importance of Data Quality Strategy ....... 88 Develop a Data Quality Strategy Before Implementing a Data Warehouse by Joseph Hudicka..................................................... 88 Data Quality Problems in the Real World .................................. 88 Why Data Quality Problems Go Unresolved ............................ 89 Fraudulent Data Quality Problems.............................................. 90 The Seriousness of Data Quality Problems................................ 91 Data Collection ............................................................................... 92 Solutions for Data Quality Issues ................................................ 92

Option 1: Integrated Data Warehouse ................................... 92 Option 2: Value Rules ............................................................... 94 Option 3: Deferred Validation................................................. 94

Periodic sampling averts future disasters .................................... 94 Conclusion ....................................................................................... 96

Chapter 12 - Data Modeling and eBusiness......................... 97 Data Modeling for the Data Warehouse by W. H. Inmon ...... 97 "Just the Facts, Ma'am" ................................................................. 97

Modeling Atomic Data.............................................................. 98 Through Data Attributes, Many Classes of Subject Areas Are Accumulated ............................................................................. 100

Other Possibilities -- - Generic Data Models........................... 103 Design Continuity from One Iteration of Development to the Next ................................................................................................ 104

Chapter 13 - Don't Forget the Customer ........................... 105 Interacting with the Internet Viewer by W. H. Inmon........... 105 IN SUMMARY............................................................................. 113

Chapter 14 - Getting Smart..................................................114 Elasticity and Pricing: Getting Smart by W. H. Inmon .......... 114 Historically Speaking .................................................................... 114 At the Price Breaking Point ........................................................ 116

vi The Data Warehousing eBusiness DBA Handbook

How Good Are the Numbers .................................................... 117

How Elastic Is the Price .............................................................. 118 Conclusion ..................................................................................... 120

Chapter 15 - Tools of the Trade: Java .................................121 The eDBA and Java by Craig S. Mullins................................... 121 What is Java?.................................................................................. 121 Why is Java Important to an eDBA?......................................... 122 How can Java improve availability? ........................................... 123 How Will Java Impact the Job of the eDBA?.......................... 124 Resistance is Futile........................................................................ 127 Conclusion ..................................................................................... 128

Chapter 16 - Tools of the Trade: XML............................... 129 New Technologies of the eDBA: XML by Craig S. Mullins . 129 What is XML? ............................................................................... 129 Some Skepticism........................................................................... 132 Integrating XML........................................................................... 133 Defining the Future Web ............................................................ 134

Chapter 17 - Multivalue Database Technology Pros and Cons ................................................................................... 136

MultiValue Lacks Value by Fabian Pascal ................................ 136 References...................................................................................... 144

Chapter 18 - Securing your Data ........................................ 146 Data Security Internals by Don Burleson................................. 146 Traditional Oracle Security.......................................................... 147 Concerns About Role-based Security........................................ 150 Closing the Back Doors............................................................... 151 Oracle Virtual Private Databases ............................................... 152 Procedure Execution Security .................................................... 158 Conclusion ..................................................................................... 160

Chapter 19 - Maintaining Efficiency.................................. 162 eDBA: Online Database Reorganization by Craig S. Mullins 162 Reorganizing Tablespaces ........................................................... 166

Table of Contents vii

Online Reorganization ................................................................. 167 Synopsis.......................................................................................... 168

Chapter 20 - The Highly Available Database .................... 170 The eDBA and Data Availability by Craig S. Mullins............. 170 The First Important Issue is Availability .................................. 171 What is Implied by e-vailability?................................................. 171 The Impact of Downtime on an e-business............................. 175 Conclusion ..................................................................................... 176

Chapter 21 - eDatabase Recovery Strategy ........................ 177 The eDBA and Recovery by Craig S. Mullins.......................... 177 eDatabase Recovery Strategies ................................................... 179 Recovery-To-Current................................................................... 181 Point-in-Time Recovery .............................................................. 183 Transaction Recovery................................................................... 184 Choosing the Optimum Recovery Strategy.............................. 188 Database Design ........................................................................... 189 Reducing the Risk ......................................................................... 189

Chapter 22 - Automating eDBA Tasks ...............................191 Intelligent Automation of DBA Tasks by Craig S. Mullins ... 191 Duties of the DBA ....................................................................... 192 A Lot of Effort ............................................................................. 194 Intelligent Automation................................................................. 195 Synopsis.......................................................................................... 196

Chapter 23 - Where to Turn for Help................................. 197 Online Resources of the eDBA by Craig S. Mullins ............... 197 Usenet Newsgroups ..................................................................... 197 Mailing Lists .................................................................................. 200 Websites and Portals .................................................................... 201 No eDBA Is an Island ................................................................. 203

viii The Data Warehousing eBusiness DBA Handbook

Conventions Used in this Book It is critical for any technical publication to follow rigorous standards and employ consistent punctuation conventions to make the text easy to read. However, this is not an easy task. Within Oracle there are many types of notation that can confuse a reader. Some Oracle utilities such as STATSPACK and TKPROF are always spelled in CAPITAL letters, while Oracle parameters and procedures have varying naming conventions in the Oracle documentation. It is also important to remember that many Oracle commands are case sensitive, and are always left in their original executable form, and never altered with italics or capitalization. Hence, all Rampant TechPress books follow these conventions: Parameters - All Oracle parameters will be lowercase italics.

Exceptions to this rule are parameter arguments that are commonly capitalized (KEEP pool, TKPROF), these will be left in ALL CAPS.

Variables – All PL/SQL program variables and arguments will also remain in lowercase italics (dbms_job, dbms_utility).

Tables & dictionary objects – All data dictionary objects are referenced in lowercase italics (dba_indexes, v$sql). This includes all v$ and x$ views (x$kcbcbh, v$parameter) and dictionary views (dba_tables, user_indexes).

SQL – All SQL is formatted for easy use in the code depot, and all SQL is displayed in lowercase. The main SQL terms (select, from, where, group by, order by, having) will always appear on a separate line.

Conventions Used in this Book ix

Programs & Products – All products and programs that are known to the author are capitalized according to the vendor specifications (IBM, DBXray, etc). All names known by Rampant TechPress to be trademark names appear in this text as initial caps. References to UNIX are always made in uppercase.

x The Data Warehousing eBusiness DBA Handbook

About the Authors Bill Inmon is universally recognized as the "father of the data

warehouse." He has more than 26 years of database technology management experience and data warehouse design expertise, and has published 36 books and more than 350 articles in major computer journals. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for many major computing associations. Inmon has consulted with a large number of Fortune 1000 clients, offering data warehouse design and database management services. For more information, visit www.BillInmon.com or call (303) 221-4000.

Joseph Hudicka is the founder of the Information Architecture Team, an organization that specializes in data quality, data migration, and ETL. Winner of the ODTUG Best Speaker award for the Spring 1999 conference, Joseph is an internationally recognized speaker at ODTUG, OOW, IOUG-A, TDWI and many local user groups. Joseph coauthored Oracle8 Design Using UML Object Modeling for Osborne/McGraw-Hill & Oracle Press, and has also written or contributed to several articles for publication in DMReview, Intelligent Enterprise and The Data Warehousing Institute (TDWI).

Craig S. Mullins is a director of technology planning for BMC Software. He has over 15 years of experience dealing with data and database technologies. He is the author of the book DB2 Developer's Guide (now available in a fourth edition that covers up to and includes the latest release of DB2 -Version 6) and is working on a book about database administration practices (to be published this year by Addison Wesley).

About the Authors xi

Craig can be reached via his Website at www.craigsmullins.com or at [email protected].

Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. He was affiliated with Codd & Date and for 20 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on data management technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, and IRS. He is founder, editor and publisher of DATABASE DEBUNKINGS (http://www.dbdebunk.com/), a Web site dedicated to dispelling persistent fallacies, flaws, myths and misconceptions prevalent in the IT industry (Chris Date is a senior contributor). Author of three books, he has published extensively in most trade publications, including DM Review, Database Programming and Design, DBMS, Byte, Infoworld and Computerworld. He is author of the contrarian columns Against the Grain, Setting Matters Straight, and for The Journal of Conceptual Modeling. His third book, Practical Issues in Database MANAGEMENT serves as text for his seminars.

xii The Data Warehousing eBusiness DBA Handbook

Foreword With the advent of cheap disk I/O subsystems, it is finally possible for database professionals to have databases store multiple billions and even multiple trillions of bytes of information. As the size of these databases increases to behemoth proportions, it is the challenge of the database professionals to understand the correct techniques for loading, maintaining, and extracting information from very large database management systems. The advent of cheap disks has also led to an explosion in business technology, where even the most modest financial investment can bring forth an online system with many billions of bytes. It is imperative that the business manager understand how to manage and control large volumes of information while at the same time provide the consumer with high-volume throughput and sub-second response time This book provides you with insight into how to build the foundation of your eBusiness application. You’ll learn the importance of the Data Warehouse in your daily operations. You’ll gain lots of insight into how to properly design and build your information architecture to handle the rapid growth that eCommerce business sees today. Once your system is up and running, it must be maintained. There is information in this text that goes through how to maintain online data systems to reduce downtime. Keeping your online data secure is another big issue with online business. To wrap things up, you’ll get links to some of the best online resources on Data Warehousing. The purpose of this book is to give you significant insights into how you can manage and control large volumes of data. As the

Foreword xiii

technology has expanded to support terabyte data capacity, the challenge to the database professionals is to understand effective techniques for the loading and maintaining of these very large database systems. This book brings together some of the world's foremost authors on data warehousing in order to provide you with the insights that you need to be successful in your data warehousing endeavors.

xiv The Data Warehousing eBusiness DBA Handbook

1 Data Warehousing and eBusiness

CHAPTER

Making the Most of E-business Everywhere you look today, you see e-business. In the trade journals. On TV. In the Wall Street Journal. Everywhere. And the message is that if your business is not e-business enabled, that you will be behind the curve. So what is all the fuss about? Behind the corporate push to get into e-business is a Web site. Or multiple Web sites. The Web site allows your corporation to have a reach into the marketplace that is direct and far reaching. Businesses that would never have entertained entry to foreign marketplaces and other marketplaces that are hard to access suddenly have easy and cheap presence. In a word, e-business opens up possibilities that previously were impractical or even impossible. So the secret to e-business is a Web site. Right? Well almost. Indeed, a Web site is a wonderful delivery mechanism. The Web site allows you to go where you might not have ever been able to go before. But after all is said and done, a Web site is merely a delivery mechanism. To be effective, the delivery mechanism must be allied with application of strong business propositions. There is a way of expressing this -- opportunity = delivery mechanism + business proposition.

Making the Most of E-business 1

Figure 1: The web site is at the heart of e-Business

To illustrate the limitations of a Web site, consider the personal Web sites that many people have created. If there were any inherent business advantage to having a Web site, then these personal sites would be achieving business results for their owners. But no one thinks that just putting up a Web site produces results. It is what you do with the Web site that counts. To exploit the delivery mechanism that is the Web environment, applications are necessary. There are many kinds of applications that can be adapted to the Web environment. But the most potent, most promising applications are a class that are called Customer Relationship Management (CRM) applications. CRM applications have the capability of producing very important business results. Executed properly, CRM applications: protect market share gain new market share increase revenues increase profits

2 The Data Warehousing eBusiness DBA Handbook

And there's not a business around that doesn't want to do these things. So what kind of applications are we talking about here? There are many different flavors. Typical CRM applications include: yield management customer retention customer segmentation cross selling up selling household selling affinity analysis market basket analysis fraud detection credit scoring, and so forth

In short, there are many different ways that applications can be created to absolutely maximize the effectiveness of the Web. Stated differently, without these applications, the Web environment is just another Web site. And there are other related non-CRM applications that can improve the bottom line of business as well. These applications include: quality control profitability analysis destination analysis (for airlines) purchasing consolidation, and the like


In short, once the Web is enabled by supporting applications, then very real business advantage occurs. But applications do not just happen by themselves. Applications such as CRM and others are built on a foundation of data called a data warehouse. The data warehouse is at the center of an infrastructure called the "corporate information factory." Figure 2 shows the corporate information factory and the Web environment.

Figure 2: Sitting behind the web site is the infrastructure called the "corporate information factory"

Figure 2 shows that the Web environment serves as a conduit into the corporate information factory. The corporate information factory provides a variety of important functions for the Web environment:


the corporate information factory enables the Web environment to gather and manage an unlimited amount of data the corporate information factory creates and environment

where sweeping business patterns can be detected and analyzed the corporate information factory provides a place where

Web-based data can be integrated with other corporate data the corporate information factory makes edited and

integrated data quickly available to the Web environment, and so forth

In a word, the corporate information factory provides the background infrastructure that turns the Web from a delivery mechanism into a truly powerful tool. The different components of the corporate information factory are: the data warehouse the corporate ODS data marts the exploration warehouse alternative/near-line storage

The heart of the corporate information factory is the data warehouse. The data warehouse is a structure that contains: detailed, granular data integrated data historical data corporate data


A convenient way to think of the data warehouse is as a structure that contain very fine grains of sand. Different

applications take those grains of sand and reshape them into the form and structure that is most familiar to the organization. One of the issues that frequently arises with applications for the Web is whether it is necessary to have a data warehouse in support of the applications. Strictly speaking, it is not necessary to have a data warehouse in support of the applications that run on the Web. Figure 3 shows that different applications have been built from the legacy foundation.

Figure 3: Building applications without a data warehouse


In Figure 3, multiple applications have been built from the same supporting applications. Looking at figure 3, it becomes clear that the same processing -- accessing data, gathering data, editing data, cleansing data, merging data and integrating data -- are done for every application. Almost all of the processing shown is redundant. There is no need for every application to repeat what every other application has done. Figure 4 shows that by building a data warehouse, the repetitive activities are done just once.

Figure 3: Building a data warehouse for the different applications


In figure 4, the infrastructure activities of accessing data, gathering data, editing data, cleansing data, merging data and integrating data are done once. The savings are obvious. But there are some other powerful reasons why building a data warehouse makes sense: when it comes time to build a new application, with a data

warehouse in place the application can be constructed quickly; with no data warehouse in place, the infrastructure has to be built again if there is a discrepancy in values, with a data warehouse

those values can be resolved easily and quickly the resources required for access of legacy data are minimal

when there is a data warehouse; when there is no data warehouse, the resources required for the access of legacy data grow with each new application, and so forth

In short, when an organization takes a long-term perspective, the data warehouse at the center of the corporate information factory is the only way to fly. It is intuitively obvious that a foundation of integrated historical granular data is useful for competitive advantage. But one step beyond intuition, the question must be asked -- exactly how can integrated historical data be turned into competitive advantage. It is the purpose of the articles to follow to explain how integrated historical data can be turned into competitive advantage and how that competitive advantage can be delivered through the Web.


The Benefits of Data Warehousing

CHAPTER

2 The Data Warehouse Foundation

The Web-based e-business environment has tremendous potential. The Web is a tremendously powerful medium for delivery of information. But there is nothing intrinsically powerful about the Web other than its ability to deliver information. In order for the Web-based e-business environment to deliver its full potential, the Web-based environment requires an infrastructure in support of its information processing needs. The infrastructure that best supports the Web is called the corporate information factory. At the center of the corporate information factory is a data warehouse. Fig 1 shows the basic infrastructure supporting the Web-based e-business environment.

The Data Warehouse Foundation 9

Figure 1: the web environment and the supporting infrastructure

The heart of the corporate information factory is the data warehouse. The data warehouse is the place where corporate granular integrated historical data resides. The data warehouse serves many functions, but the most important function it serves is that of making information available cheaply and quickly. Stated differently, without a data warehouse the cost of information goes sky high and the length of time required to get information is exceedingly long. If the Web-based e-business environment is to be successful, it is necessary to have information that is cheap to access and immediately available. How does the data warehouse lower the cost of getting information? And how does the data warehouse greatly accelerate the speed with which information is available? These


issues are not immediately obvious when looking at the structure of the corporate information factory. In order to explain how the data warehouse accomplishes its important functions, consider the seemingly innocent request for information in a manufacturing environment where there is no data warehouse. A financial analyst wants to find out what corporate sales were for the last quarter. Is this a reasonable request for information? Absolutely. Now, what is required to get that information?

Figure 2: getting information from applications

Fig 2 shows that many different sources have to be accessed to get the desired information. Some of the data is in IMS; some is in VSAM. Yet other files are in ADABAS. The key structure of the European file is different from the key structure of the Asian file. The parts data uses different closing dates than the truck data. The body design for cars is called one thing in the cars file and another thing in the parts file. To get the required information takes lots of analysis, access to 10 programs and the ability to integrate the data. Moreover, it takes six months to deliver the information -- at a cost of $250,000.


These numbers are typical for a mid-sized to large corporation. In some cases these numbers are very much understated. But the real issue isn't the costs and length of time required for accessing data. The real issue is how many resources are needed for accessing many units of information. Fig 3 shows that seven different types of information have been requested.

Figure 3: getting information from applications for seven different reports

The costs that were described for Fig 2 now are multiplied by seven (or whatever number of units of data are required). As the analyst is developing the procedures for getting the unit of information required, no thought is given to getting information for other units of information. Therefore each


time a new piece of information is required, the process described in Fig 2 begins all over again. AS a result, the cost of information spikes dramatically. But suppose, for example, that this organization had a data warehouse. And suppose the organization had a request for seven units of information. What would it cost to get that information and how long would it take? Fig 4 illustrates this scenario.

Figure 4: making a report from a data warehouse

Once the data warehouse is built, it can serve multiple requests for information. The granular integrated data that resides in the data warehouse is ideal for being shaped and reshaped. One analyst can look at the data one way; another analyst can look at the same data in yet another way. And you only have to create the infrastructure once. The financial analyst may spend 30 minutes tracking down a unit of data, such as consolidated sales. Or if the data is difficult to calculate it may take a day to get the job done. Depending on the complexity and how costs are calculated, it may cost from between $100 to $1000 to


access the data. Compare that price range to what it might cost at an organization with no data warehouse, and it becomes obvious why a data warehouse makes data available quickly and cheaply. Of course the real difference between having a data warehouse and not having one lies in not having to build the infrastructure required for accessing the data. With a data warehouse, you build the infrastructure only once. With no data warehouse, you have to build at least part of the infrastructure every time you want new data. In reality, however, no company goes looking for just one piece of data. In fact, it's quite the opposite - most companies require many forms of data. And the need for new forms and structures of data is recreated every day. When it comes to looking at the larger picture - not the cost of data for a single item, but for the cost of data for all data - the data warehouse greatly eases the burden placed on the information systems organization. Fig 5 shows the difference between having a data warehouse and not having a data warehouse in the case of finding multiple types of data.


Figure 5: making seven reports from a data warehouse

Looking at Fig 5, it's obvious that a data warehouse really does lower the cost of getting information and greatly accelerates the rate at which data can be found. But organizations have a habit of not looking at the big picture, preferring instead to focus on immediate needs. They look only up to next Tuesday and not an hour beyond it. What do short-sighted organizations see? The comparison between the data warehouse infrastructure and the need for a single unit of information. Fig 6 shows this comparison.


Figure 6: when all you are looking at is a single report it appears that it is more expensive to get it from applications directly and not build a data warehouse

When looking at the diagram in Fig 6, the short-term approach of not building a data warehouse is attractive. The organization thinks only of the quick fix. And in the very short term, it is less expensive just to dive in and get data from applications without building a data warehouse. There are a hundred excuses the corporation has for not looking to the long term: The data warehouse is so big We heard that data warehouses don't really work All we need is some quick and dirty information I don't have time to build a data warehouse If I build a data warehouse and pay for it, one of my

neighbors will use the data later on and they don't have to pay for it, and so forth.

As long as a corporation insists on having nothing but a short-term focus, it will never build a data warehouse. But the minute the corporation takes a long-term look, the future becomes an entirely different picture. Fig 7 shows the long-term focus.


Figure 7: when you look at the larger picture you see that building a data warehouse saves huge amounts of resources

Fig 7 shows that when the long-term needs for information are considered, the data warehouse is far and away the less expensive than the series of short term efforts. And the length of time for access to information is an intangible whose worth is difficult to measure. No one argues that information today, right now is much more effective than information six months from now. In fact, six months from now I will have forgotten why I wanted the information in the first place. You simply cannot beat a data warehouse for speed and ease of access of information. The Web environment, then, is a most promising environment. But in order to unlock the potential of the Web, information must be freely and cheaply available. The supporting infrastructure of the data warehouse provides that foundation and is at the heart of the effectiveness of the Web environment.


References Inmon, W. H. - The Corporate Information Factory, 2nd edition, John Wiley, NY, NY 2000 Inmon, W. H. - Building the Data Warehouse, 2nd edition, John Wiley, NY, NY 1998 Inmon, W. H. - Building the Operational Data Store, 2nd edition, John Wiley, NY, NY 1999 Inmon, W. H. - Exploration Warehousing, John Wiley, NY, NY 2000 Website - www.BILLINMON.COM, a site containing useful information about architecture, data models, articles, presentations, white papers, near line storage, exploration warehousing, methodologies and other important topics.


The Value of the Data Warehouse

CHAPTER

3 The Foundations of E-Business

The basis for a long-term, sound e-business competitive advantage is the data warehouse.

Why the Internet? Consider the Internet. When you get down to it, what is the Internet good for? It is good for connectivity, and with connectivity comes opportunity - the opportunity to sell somebody something, to help someone, to get a message across. But at the same time, connectivity is ALL the Internet provides. In order to take advantage of that connectivity, the real competitive advantage is found in the content and presentation of the messages that are passed along the lines of connectivity. Consider the telephone. Before the advent of the telephone, getting a message to someone was accomplished by mail or shouting. Then when the telephone appeared, it was possible to have cheap and instant access to someone. But merely making a quick call becomes a trite act. The important thing about making a telephone call quickly is what you say to the person, not the fact that you did it cheaply and quickly. The message delivered over the phone becomes the essence, not the phone itself. With the phone you can:

The Foundations of E-Business 19

ask your girlfriend out for Saturday night tell the county you aren't available for jury duty call in sick for work and go play golf find out if it had snowed in Aspen last night call the doctor, and so forth.

The real value of the phone is the communication of the message. The same is true of the Internet. Today, people are enamored of the novelty of the ability to communicate instantaneously. But where commercial advantage is concerned, the real value of the Internet lies in the messages that are passed through cyberspace, not in the novelty of the passage itself.

Intelligent Messages To give your messages sent via the Internet some punch, you need intelligence behind them. And the basis of that intelligence is the information that is buried in a data warehouse. Why is the data warehouse the basis of business intelligence? Simple. With a data warehouse, you have two facets of information that have otherwise not been available: integration and history. In years past, application systems have been built in which each application considered only its own set of requirements. One application thought of a customer as one thing, another application thought of a customer as something else. There was no integration - no cohesive understanding of information - from one application to the next.


And the applications of yesterday paid no mind to history. The applications of yesterday looked only at what was happening right now. Ask a bank what your bank account balance is today and they can tell you. But ask them what your average balance has been over the past twelve months and they have no idea.

Integration, History and Versatility The essence of data warehousing is integration and history. Integration is achieved by the messy task of going back into older legacy systems and pulling out data that was a by-product of transaction processing, and converting and integrating that data. Integrating old legacy data is a dirty, thankless task that nobody wants to undertake, but the rewards of integration are worth the time and effort. Historical data is achieved by organizing and collecting the integrated data over time. Data is time-stamped and stored at the detailed level. Once an organization has a carefully crafted collection of integrated detailed historical data, it is in a position of great strength. The first real value to the collection of data - a data warehouse - is the versatility of the data. The data can be organized a certain way on one day and another way the next. Marketing can look at customers by state or by month, Sales can look at sales transactions per day, and Accounting can look at closed business by country or by quarter - all from the same store of data. A top manager can walk in at 8:00 am and decide that he or she wants to look at the world in a manner no one else has thought of and the integrated, detailed historical data will allow that to happen. Done properly, the manager can have his or her report by 5:00 p.m. that same afternoon. So the first tremendous business value that a data warehouse brings is the ability to look at data any way that is useful. But

Integration, History and Versatility 21

looking at data internally doesn't really have anything to do with e-business or the Internet. And the data warehouse has tremendous advantages there. How do the Internet and the data warehouse work together to produce a business advantage? The Internet provides connectivity and the data warehouse produces continuity.

The Value of Historical Data Consider the value of historical data when it comes to understanding a customer. When you have historical data about customers, you have the key to understanding their future behavior. Why? Because people are creatures of habit with predictable life patterns. The habits that we form early in our life stick with us throughout our life. The clothes we wear, the place we live, the food we eat, the cars we drive, how we pay our bills, how we invest, where we go on vacation - all of these features are set early in our adulthood. Understanding a customer's past history then becomes a tremendous predictor of the future. Customers are subject to patterns. In our youth, most of us don't have much money to invest. But as we get older, we have more disposable income. At mid-life, our children start looking for colleges. At late mid-life, we start thinking about retirement. In short, there are predictable patterns of behavior that practically everyone experiences. Knowing the history of your customer allows you to predict what the next pattern of behavior will be. What happens when you can predict your customer's behavior? Basically, you're in a position to package products and tailor them to your customers. Having historical data that resides in a


data warehouse lets you do exactly that. Through the Internet, you reach the customer. Then, the data warehouse tells you what you to say to the customer to get his or her attention. The information in the data warehouse allows you to craft a message that your customer wants to hear.

Integrated Data Integrated data has a related but different effect. Suppose you are a salesperson wanting to sell something (it really doesn't matter what). Your boss gives you a list and says go to it. Here's your list: acct 123 acct 234 acct 345 acct 456 acct 567 acct 678

You start by making a few contacts, but you find that you're not having much success. Most everyone on your list isn't interested in what you're selling. Now somebody suggests that you get a little integrated data. You don't know exactly what that is, but anything is better than beating your head against a wall. So now you have a list of very basic integrated data: acct 123 - John Smith - male acct 234 - Mary Jones - female acct 345 - Tom Watson - male acct 456 - Chris Ng - female acct 567 - Pat Wilson - male acct 678 - Sam Freed - female

This simple integrated data makes your life as a salesperson a littler simpler. You know not to sell bras to a male or cigars to a female (or at least not to most females.) Your sales productivity

Integrated Data 23

improves. Then someone suggests that you get some more integrated data. So you do. Anything beats trying to sell something blind. Here's how your list looks with even more integrated data: acct 123 - John Smith - male - 25 years old - single acct 234 - Mary Jones - female - 58 years old - widow acct 345 - Tom Watson - male - 52 years old - married acct 456 - Chris Ng - female - 18 years old - single acct 567 - Pat Wilson - male - 68 years old - married acct 678 - Sam Freed - female - 45 years old - married

Now we are getting somewhere. With age and marital status, you can be a lot smarter about choosing what we sell and to whom. For example, you probably don't want to sell a life insurance policy to Chris Ng because she is 18 and single and unlikely to buy a life insurance policy. But Sam Freed is a good bet. With integrated data, the sales process becomes a much smoother one. And you don't waste time trying to sell something to someone who probably won't buy it. So integrated data is a real help in knowing who you are dealing with. Now you decide we want even more integrated data:

acct 123 - John Smith - male - 25 years old - single - profession - accountant - income - 35,000 - no family

acct 234 - Mary Jones - female - 58 years old - widow - profession - teacher - income - 40,000 - daughter and two sons

acct 345 - Tom Watson - male - 52 years old - married - profession - doctor - income - 250,000 - son and daughter

acct 456 - Chris Ng - female - 18 years old - single - profession - hair dresser - income - 18,000 - no family


acct 567 - Pat Wilson - male - 68 years old - married - profession - retired - income - 25,000 - two sons

acct 678 - Sam Freed - female - 45 years old - married - profession - pilot - income - 150,000 - son and daughter

With the new infusion of integrated information, the salesperson can start to be very scientific about who to target. Trying to sell a new Ferrari to Pat Wilson is not likely to produce any good results at all. Pat simply does not have the income to warrant such a purchase. But trying to sell the Ferrari to Sam Freed or Tom Watson may produce some results because they can afford it. Adding even more integrated information produces the following results:

acct 123

- John Smith - male - 25 years old - single - profession - accountant - income - 35,000 - no family - owns home - net worth - 15,000 - drives Ford - school - CU - degree - BS - hobbies - golf

acct 234

- Mary Jones - female - 58 years old - widow - profession - teacher - income - 40,000 - daughter and two sons - rents - net worth - 250,000 - drives Chevrolet - school - NMSU - degree - BS - hobbies - mountain climbing

acct 345

- Tom Watson - male - 52 years old - married - profession - doctor - income - 250,000 - son and daughter - owns home - net worth - 3,000,000 - drives - Mercedes - school - Yale - degree - MBA - hobbies - stamp collecting

Integrated Data 25

acct 456

- Chris Ng - female - 18 years old - single - profession - hair dresser - income - 18,000 - no family - rents - net worth - 0 - drives - Honda - school - none - degree - none - hobbies - hiking, tennis

acct 567

- Pat Wilson - male - 68 years old - married - profession - retired - income - 25,000 - two sons - rents - net worth - 25,000 - drives - nothing - school - U Texas - degree - PhD - hobbies - watching football

acct 678

- Sam Freed - female - 45 years old - married - profession - pilot - income - 150,000 - son and daughter - owns home - net worth - 750,000 - drives - Toyota - school - UCLA - degree - BS - hobbies - thimble collecting

Now the salesperson is armed with even more information. Qualifying who will be a prospect to buy is now a reasonable task. More to the point, knowing who you are talking to on the Internet is no longer a hit-or-miss proposition. You can start to be very accurate about what you say and what you offer. Your message across the Internet becomes a lot more cogent.

Looking Smarter Stated differently, with integrated data you can be a great deal more accurate and efficient in your sales efforts. Integrated data saves huge amounts of time that would otherwise be wasted. With integrated customer data, your Internet messages start to make you look smart. But making sales isn't the only use for integrated information. Marketing can also make great use of this information. It probably doesn't make sense, for example, to market tennis


equipment to Sam Freed. Chris Ng is a much better bet for that. And it probably doesn't make sense to market football jerseys to Tom Watson. Instead, marketing those things to Pat Wilson makes the most sense. Integrated information is worth its weight in gold when it comes to not wasting marketing dollars and opportunities. The essence of the data warehouse is historical data and integrated data. When the euphoria and the novelty of being able to communicate with someone via the Internet wears off, the fact remains that the message being communicated is much more important than the means. To create meaningful messages, the content of the data warehouse is ideal for commercial purposes.

Looking Smarter 27

The Role of the eDBA CHAPTER

4 Logic, e-Business, and the Procedural eDBA

Until recently, the domain of a database management system was, appropriately enough, to store, manage, and access data. Although these core capabilities are still required of a modern DBMS, additional procedural functionality is becoming not just a nice-to-have feature, but a necessity. A modern DBMS has the ability to define business rules to the DBMS instead of in a separate, application program. Specifically, all of the most popular RDBMS products support an array of complex features and components to facilitate procedural logic. Procedural DBMS facilities are being driven by organizations as they move to become e-businesses. As the DBMS adapts to support more procedural capabilities, organizations must modify and expand the way they handle database management and administration. Typically, as new features are added, the administrative, design, and management of these features is assigned to the database administrator (DBA) by default. Simply dumping these new administrative burdens on the already overworked DBA staff may not be the best approach. But "DBA-like duties" are required to effectively manage these procedural elements.

The Classic Role of the DBA Every database programmer has their favorite "curmudgeon DBA" story. You know, those famous anecdotes that begin


with "I have a problem..." and end with "...and then he told me to stop bothering him and read the manual." DBAs simply do not have a "warm and fuzzy" image. This probably has more to do with the nature and scope of the job than anything else. The DBMS spans the enterprise, effectively placing the DBA on call for the applications of the entire organization. To make matters worse, the role of the DBA has expanded over the years. In the pre-relational days, both database design and data access was complex. Programmers were required to code program logic to navigate through the database and access data. Typically, the pre-relational DBA was assigned the task of designing the hierarchic or network database design. This process usually consisted of both logical and physical database design, although it was not always recognized as such at the time. After the database was designed and created, and the DBA created backup and recovery jobs, little more than space management and reorganizations were required. I do not want to belittle these tasks. Pre-relational DBMS products (such as IMS and IDMS) require a complex series of utility programs to be run in order to perform backup, recovery, and reorganization. This can consume a large amount of time, energy, and effort. As RDBMS products gained popularity, the role of the DBA expanded. Of course, DBAs still designed databases, but increasingly these were generated from logical data models created by data administrators and data modelers. Now the DBA has become involved in true logical design and must be able to translate a logical design into a physical database implementation. Relational database design still requires physical implementation decisions such as indexing, denormalization, and partitioning schemes. But, instead of merely concerning themselves with physical implementation

The Classic Role of the DBA 29

and administration issues, relational DBAs must become more intimately involved with procedural data access. This is so because the RDBMS creates data access paths. As such, the DBA must become more involved in the programming of data access routines. No longer are programmers navigating through data; now the RDBMS does that. Optimizer technology embedded in the RDBMS is responsible for creating the access paths to the data. And these optimization choices must be reviewed - usually by the DBA. Program and SQL design reviews are now a vital component of the DBA's job. Furthermore, the DBA must tackle additional monitoring and tuning responsibilities. Backup, recover, and REORG are just a start. Now, DBAs use EXPLAIN, performance monitors, and SQL analysis tools to proactively administer RDBMS applications. Oftentimes, DBAs are not adequately trained in these areas. It is a distinctly different skill to program than it is to create well-designed relational databases. DBAs must understand application logic and programming techniques to succeed. And now the role of the DBA expands even further with the introduction of database procedural logic.

The Trend of Storing Process With Data Today's modern RDBMS stores procedural logic in the database, further complicating the job of the DBA. The popular RDBMSs of today support database-administered procedural logic in the form of stored procedures, triggers, and user-defined functions (UDFs).


Stored procedures can be thought of as programs that are maintained, administered, and executed through the RDBMS. The primary reason for using stored procedures is to move application code off of a client workstation and on to the database server to reduce overhead. A client can invoke the stored procedure and then the procedure invokes multiple SQL statements. This is preferable to the client executing multiple SQL statements directly because it minimizes network traffic, thereby enhancing performance. A stored procedure can access and/or modify data in one or more tables. Basically, stored procedures work like "programs" that "live" in the RDBMS. Triggers are event-driven specialized procedures that are stored in, and executed by, the RDBMS. Each trigger is attached to a single, specified table. Triggers can be thought of as an advanced form of "rule" or "constraint" written using procedural logic. A trigger cannot be directly called or executed; it is automatically executed (or "fired") by the RDBMS as the result of an action-usually a data modification to the associated table. Once a trigger is created it is always executed when its "firing" event occurs (update, insert, delete, time, etc.). A user-defined function, or UDF, is procedural code that works within the context of SQL statements. Each UDF provides a result based on a set of input values. UDFs are programs that can be executed in place of standard, built-in SQL scalar or column functions. A scalar function transforms data for each row of a result set; a column function evaluates each value for a particular column in each row of the results set and returns a single value. Once written, and defined to the RDBMS, a UDF can be used in SQL statements just like any other built-in functions.

The Trend of Storing Process With Data 31

Stored procedures, triggers, and UDFs are just like other database objects such as tables, views, and indexes, in that they are controlled by the DBMS. These objects are often collectively referred to as database code objects, or DBCOs, because they are actually program code that is stored and maintained by a database server as a database object. Depending on the particular RDBMS implementation, these objects may or may not "physically" reside in the RDBMS. They are, however, always registered to, and maintained in conjunction with, the RDBMS.

Database Code Objects and e-Business The drive to develop Internet-enabled applications has led to increased usage of database code objects. DBCOs can reduce development time and everyone knows that Web-based projects are tasked out in Web time - there is a lot to do but little time in which to do it. DBCOs help because using they promote code reusability. Instead of replicating code on multiple servers or within multiple application programs, DBCOs enable code to reside in a single place: the database server. DBCOs can be automatically executed based on context and activity or can be called from multiple client programs as required. This is preferable to cannibalizing sections of program code for each new application that must be developed. DBCOs enable logic to be invoked from multiple processes instead of being re-coded into each new process every time the code is required. An additional benefit of DBCOs is increased consistency. If every user and every database activity (with the same requirements) is assured of using the DBCO instead of multiple, replicated code segments, then you can assure that everyone is running the same, consistent code. If each


individual user used his or her own individual and separate code, no assurance could be given that the same business logic was being used by everyone. Actually, it is almost a certainty that inconsistencies will occur. Additionally, DBCOs are useful for reducing the overall code maintenance effort. Because DBCOs exist in a single place, changes can be made quickly without requiring propagation of the change to multiple workstations. Another common reason to employ DBCOs is to enhance performance. A stored procedure, for example, may result in enhanced performance because it may be stored in parsed (or compiled) format thereby eliminating parser overhead. Additionally, stored procedures reduce network traffic because multiple SQL statements can be invoked with a single execution of a procedure instead of sending multiple requests across the communication lines. UDFs in particular are used quite often in conjunction with multimedia data. And many e-business applications require multimedia instead of static text pages. UDFs can be coded to manipulate multimedia objects that are stored in the database. For example, UDFs are available that can play audio files, search for patterns within image files, or manipulate video files. Finally, DBCOs can be coded to support database integrity constraints, implement security requirements, and support remote data access. DBCOs are useful for creating specialized management functionality for the multimedia data types required of leading-edge e-business applications. Indeed, there are many benefits provided by DBCOs.

Database Code Objects and e-Business 33

Database Code Object Programming Languages Because they are application logic, most server code objects must be created using some form of programming language. Check constraints and assertions do not require procedural logic as they can typically be coded with a single predicate. Although different RDBMS products provide different approaches for DBCO development, there are three basic tactics employed: Use a proprietary dialect of SQL extended to include

procedural constructs Use a traditional programming language (either a 3GL or a

4GL) Use a code generator to create DBCOs

The most popular approach is to use a procedural SQL dialect. One of the biggest benefits derived from moving to a RDBMS is the ability to operate on sets of data with a single line of code. Using a single SQL statement, multiple rows can be retrieved, modified, or removed. But this very capability limits the viability of using SQL to create server code objects. All of the major RDBMS products support procedural dialects of SQL that add looping, branching, and flow of control statements. The Sybase and Microsoft language is known as Transact-SQL, Oracle provides PL/SQL, and DB2 uses a more ANSI standard language simply called SQL procedure language. Procedural SQL has major implications on database design. Procedural SQL will look familiar to anyone who has ever written any type of SQL or coded using any type of programming language. Typically, procedural SQL dialects contain constructs to support looping (while), exiting (return),


branching (goto), conditional processing (if...then...else), blocking (begin...end), and variable definition and usage. Of course, the procedural SQL dialects (Transact-SQL, PL/SQL, and SQL Procedure Language) are incompatible and can not interoperate with one another. The second approach is one supported by DB2 for OS/390: using a traditional programming languages to develop for stored procedures. Once coded the program is registered to DB2 and can be referenced by SQL procedure calls. A final approach is to use a tool to generate the logic for the server code object. Code generators can be used for any of RDBMS that supports DBCOs, as long as the code generator supports the language required by the RDBMS product being used. Of course, code generators can be created for any programming language. Which is the best approach? Of course, the answer is "It depends!" Each approach has its strengths and weaknesses. Traditional programming languages are more difficult to use but provide standards and efficiency. Procedural SQL is easier to use and more likely to be embraced by non-programmers, but is non-standard from product to product and can result in sub-optimal performance. It would be nice if the developer had an implementation choice, but the truth of the matter is that he must live with the approach implemented by the RDBMS vendor.

The Duality of the DBA Once DBCOs are coded and made available to the RDBMS, applications and developers will begin to rely on them.

The Duality of the DBA 35

Although the functionality provided by DBCOs is unquestionably useful and desirable, DBAs are presented with a major dilemma. Now that procedural logic is being stored in the DBMS, DBAs must grapple with the issues of quality, maintainability, and availability. How and when will these objects be tested? The impact of a failure is enterprise-wide, not relegated to a single application. This increases the visibility and criticality of these objects. Who is responsible if they fail? The answer must be -- a DBA. With the advent of DBCOs, the role of the DBA is expanding to encompass far too many duties for a single person to perform the role capably. The solution is to split the DBA's job into two separate parts based on the database object to be supported: data objects or database code objects. Administering and managing data objects is more in line with the traditional role of the DBA, and is well-defined. But DDL and database utility experts cannot be expected to debug procedures and functions written in C, COBOL, or even procedural SQL. Furthermore, even though many organizations rely on DBAs to be the SQL experts in the company, often, times these DBAs are not - at least not DML experts. Simply because the DBA knows the best way to create a physical database design and DDL does not mean he will know the best way to access that data. The role of administering the procedural logic in the RDBMS should fall on someone skilled in that discipline. A new type of DBA must be defined to accommodate DBCOs and procedural logic administration. This new role can be defined as a procedural DBA.


The Role of the Procedural DBA The procedural DBA should be responsible for those database management activities that require procedural logic support and/or coding. Of course, this should include primary responsibility for DBCOs. Whether DBCOs are actually programmed by the procedural DBA will differ from shop to shop. This will depend on the size of the shop, the number of DBAs available, and the scope of DBCO implementation. At a minimum, the procedural DBA should participate in and lead the review and administration of DBCOs. Additionally, he should be on call for DBCO failures. Other procedural administrative functions that should be allocated to the procedural DBA include application code reviews, access path review and analysis (from EXPLAIN or show plan), SQL debugging, complex SQL analysis, and re-writing queries for optimal execution. Off-loading these tasks to the procedural DBA will enable the traditional, data-oriented DBAs to concentrate on the actual physical design and implementation of databases. This should result in much better designed databases. The procedural DBA should still report through the same management unit as the traditional DBA and not through the application programming staff. This enables better skills sharing between the two distinct DBA types. Of course, there will need to be a greater synergy between the procedural DBA and the application programmer/analyst. In fact, the typical job path for the procedural DBA should come from the application programming ranks because this is where the coding skill-base exists.

The Role of the Procedural DBA 37

Synopsis As organizations begin to implement more procedural logic using the capabilities of the RDBMS, database administration will become increasingly more complicated. The role of the DBA is rapidly expanding to the point where no single professional can be reasonably expected to be an expert in all facets of the job. It is high time that the job be explicitly defined into manageable components.


Building a Solid Information Architecture

CHAPTER

5

How to Select the Optimal Information Exchange Architecture

Introduction Over 80 percent of Information Technology (IT) projects fail. Startling? Maybe. Surprising? Not at all. In almost every IT project that fails, weakly documented requirements are typically the reason behind the failure. And nowhere is this more obvious than in data migration. As pointed out by Jim Collin’s book, Good to Great, technology is at best an accelerator of a company’s growth. The fact is, IT would not exist if not to improve a business and its ability to meet its demand efficiently. Data is the natural by-product of IT systems, which provide structure around the data, as it moves through various levels of operational processing. But is the value of data purely operational? If that were the case, there would be no need for migration. Companies can conduct forecasting exercises based on ordering trends of recent or parallel time periods, project fulfillment limits based on historic capacity measurements, or detect fraudulent activity by analyzing insurance claim trends for anomalies.

How to Select the Optimal Information Exchange A hi

39

As more companies begin to understand the strategic value of data, the demands for accessing the data in new, innovative ways increase. This growth in information exchange requirements is precisely why a company must carefully deploy a solid information exchange architecture that can grow with the company’s ever-changing information sharing needs.

The Main Variables to Ponder The main variables you have to consider are throughput of data across the network and processing power for transformation and cleansing. These are formidable challenges — fraught with potential danger like that bubble that forms on the inside wall of a tire as the tread wears through, soon to give way to a blowout. First, get some diagnostics of the current environment: Data Volume — Determine how much data needs to move

from point to point (or server to server) in the information exchange. Available System Resources — Determine how much

processing power is available at each point. Take these measurements at both peak and non-peak intervals. Transformation Requirements — Estimate the amount of

transformation and cleansing to be conducted. Frequency — Determine the frequency at which this

volume of data will be transmitted.

Data Volume


Understanding how much data must be moved from point to point will give you metrics against which you can compare your network bandwidth. If your network is nearly saturated already,

adding the burden of information exchange may be more than it can handle.

Available System Resources Determining how to maximize existing system resources is a significant savings measure. Can the information exchange be run during off peak hours? Can the transformation be conducted on a server that is not fully utilized during peak hours? This is an exercise that should be conducted annually to ensure that you’re getting the most out of your existing equipment, but it clearly provides immediate benefit when designing an information exchange solution.

Transformation Requirements Before all else, be sure to conduct a Data Quality Assessment (DQA) to evaluate the health of your data resources. Probably the most high profile element of an information architecture is its error management/resolution capabilities. A DQA will identify existing problems that lurk in the data, and highlight enhancements that should be made to the systems that generate this data, to prevent such data quality concerns from happening in the future. Of course, there will be some issues that simply are not preventable, and others that have not yet materialized. In this case, it will be beneficial to implement monitors that periodically sample your data in search of non-compliant data.

Frequency Determine how often data must be transmitted from point to point. Will data flow in one direction, or will it be bi-directional? Will it be flowing between two systems, or more than two? Can the information be exchanged weekly, nightly, or must it be as near to real-time as technically feasible?

The Main Variables to Ponder 41

Optimal Architecture Components The optimal information exchange architecture will include as many of the following components as warranted by the projects objectives: 1. Data profiling 2. Data cleansing 3. System/network bandwidth resources 4. ETL (Extraction, Transformation & Loading) 5. Data monitoring Naturally, there are commercial products available for each of these components, but you can just as easily build utilities to address your specific objectives.

Conclusion While there is no single architecture that is ideal for all Information exchange projects, the components laid out in this paper are the key criteria that successful information exchange projects address. Perhaps you can apply this five-tier architecture to a new information exchange project, or evaluate existing information exchange architectures in comparison to it, and see if there is room for improvement. It is never too late to improve the foundation of such a critical business tool. The more adept we become at sharing information electronically, the more rapidly our businesses can react to the daily changes that inevitably affect the bottom line. Rapid access to high quality information on demand is the name of the game, and the first step is implementing a solid, stable, information architecture.


Data 101 CHAPTER

6 Getting Down to Data Basics

Well, this is the fourth eDBA column I have written for dbazine and I think it's time to start over at the beginning. Up to this point we have focused on the transition from DBA to eDBA, but some e-businesses are brand new to database management. These organizations are implementing eDBA before implementing DBA. And the sad fact of the matter is that many are not implementing any formalized type of DBA at all. Some daring young enterprises embark on Web-enabled database implementation with nothing more than a bevy of application developers. This approach is sure to fail. If you take nothing else away from this article, make sure you understand this: every organization that manages data using a database management system (DBMS) requires a database administration group to ensure the effective use and deployment of the company's databases. In short, e-businesses that are brand new to database development need a primer on database design and administration. So, with that in mind, it's time to get back to data basics.

Data Modeling and Database Design Novice database developers frequently begin with the quick-and-dirty approach to database implementation. They approach

Getting Down to Data Basics 43

database design from a programming perspective. That is, novices do not have experience with databases and data requirements gathering, so they attempt to design databases like the flat files they are accustomed to using. This is a major mistake, as anyone using this approach quickly finds out once the databases and application moves to production. At a minimum, performance will suffer and data may not be as readily available as required. At worst, data integrity problems may arise rendering the entire application unusable. A relational database design can not be thrown together quickly by novices. What is required is a practiced and formal approach to gathering data requirements and modeling that data. This modeling effort requires that the naming entities and data elements follow an established and standard naming convention. Failure to apply standard names will result in the creation of databases that are difficult to use because no one knows its actual contents. Data modeling also requires the collection of data types and lengths, domains (valid values), relationships, anticipated cardinality (number of instances), and constraints (mandatory, optional, unique, etc.). Once collected and the business usage of the data is known, a process called normalization is applied to the data model. Normalization is an iterative process that I will not cover in detail here. Suffice it to say, the a normalized data model reduces data redundancy and inconsistencies by ensuring that the data elements are designed appropriately. A series of normalization rules are applied to the entities and data elements, each of which is called a "normal form." If the data


conforms to the first rule, the data model is said to be in "first normal form," and so on. A database design in First Normal Form (1NF) will have no repeating groups and each instance of an entity can be identified by a primary key. For Second Normal Form (2NF), instances of an entity must not depend on anything other than the primary key for that entity. Third Normal Form (3NF) removes data elements that do not depend on the primary key. If the contents of a group of data elements can apply to more than a single entity instance, those data elements belong in a separate entity. There are further levels of normalization that I will not discuss in this column to keep the discussion moving along. For an introductory discussion of normalization visit http://wdvl.com/Authoring/DB/Normalization.

Physical Database Design But you cannot stop after developing a logical data model in 3NF. The logical model must be adapted to a physical database implementation. Contrary to popular belief this is not a simple transformation of entities to tables. Many other physical design factors must be planned and implemented. These factors include: A relational table is not the same as a file or a data set. The

DBA must design and create the physical storage structures to be used by the relational databases to be implemented. The order of columns may need to be different than that

specified by the data model based on the functionality of the RDBMS being used. Column order and access may have an impact on database logging, locking, and organization.

Physical Database Design 45

The DBA must understand these issues and transform the logical model appropriately. The logical data model needs to be analyzed to determine

which relationships need to be physically implemented using referential integrity (RI). Not all relationships should be defined using RI due to processing and performance reasons. Indexes must be designed to ensure optimal performance.

To create the proper indexes the DBA must examine the database design in conjunction with the proposed SQL to ensure that database queries are supported with the proper indexes. Database security and authorization must be defined for the

new database objects and its users. These are not simple tasks that can be performed by individuals without database design and implementation skills. That is to say, DBAs are required.

The DBA Management Discipline Database administration must be approached as a management discipline. The term discipline implies planning and implementation, according to that plan. When database administration is treated as a management discipline, the treatment of data within your organization will improve. It is the difference between being reactive and proactive. All too frequently the DBA group is overwhelmed by requests and problems. This occurs for many reasons, including understaffing, overcommitment to application development projects, lack of repeatable processes, lack of budget and so on.


When operating in this manner, the database administrator is being reactive. The reactive DBA functions more like a firefighter. His attention is focused on resolving the biggest problem being brought to his attention. A proactive DBA can avoid many problems altogether by developing and implementing a strategic blueprint to follow when deploying databases within their organization.

The 17 Skills Required of a DBA Implementing a DBA function in your organization requires careful thought and planning. The previous sections of this article are just a beginning. The successful eDBA will need to acquire and hone expertise in the following areas:

Data modeling and database design. The DBA must possess the ability to create an efficient physical database design from a logical data model and application specifications. The physical database may not conform to the logical model 100 percent due to physical DBMS features, implementation factors, or performance requirements. If the data resource management discipline has not been created, the DBA also must be responsible for creating data modeling, normalization, and conceptual and logical design.

Metadata management and repository usage. The DBA is required to understand the technical data requirements of the organization. But this is not a complete description of his duties. Metadata, or data about the data, also must be maintained. The DBA, or sometimes the Data Administrator (DA), must collect, store, manage, and enable the ability to query the organization's metadata. Without metadata, the data stored in databases lacks true meaning.

The 17 Skills Required of a DBA 47

Database schema creation and management. A DBA must be able to translate a data model or logical database design into

an actual physical database implementation and to manage that database once it has been implemented.

Procedural skills. Modern databases manage more than merely data. The DBA must possess procedural skills to help design, debug, implement, and maintain stored procedures, triggers, and user-defined functions that are stored in the DBMS. For more on this topic check out www.craigsmullins.com/db2procd.htm.

Capacity planning. Because data consumption and usage continues to grow, the DBA must be prepared to support more data, more users, and more connection. The ability to predict growth based on application and data usage patterns and to implement the necessary database changes to accommodate the growth is a core capability of the DBA.

Performance management and tuning. Dealing with performance problems is usually the biggest post-implementation nightmare faced by DBAs. As such, the DBA must be able to proactively monitor the database environment and to make changes to data structures, SQL, application logic or the DBMS subsystem to optimize performance.

Ensuring availability. Applications and data are more and more required to be up and available 24 hours a day, seven days a week. The DBA must be able to ensure data availability using non-disruptive administration tactics.

SQL code reviews and walk-throughs. Although application programmer usually write SQL, DBAs are usually blamed for poor performance. Therefore, DBAs must possess in-depth SQL knowledge so they can understand and review SQL and host language programs and to recommend changes for optimization.


Backup and recovery. Everyone owns insurance of some type because we want to be prepared in case something bad happens. Implementing robust backup and recovery procedures is the insurance policy of the DBA. The DBA must implement an appropriate database backup and recovery strategy based on data volatility and application availability requirements.

Ensuring data integrity. DBAs must be able to design databases so that only accurate and appropriate data is entered and maintained. To do so, the DBA can deploy multiple types of database integrity including entity integrity, referential integrity, check constraints, and database triggers. Furthermore, the DBA must ensure the structural integrity of the database.

General database management. The DBA is the central source of database knowledge in the organization. As such he must understand the basic tenets of relational database technology and be able to accurately communicate them to others.

Data security. The DBA is charged with the responsibility to ensure that only authorized users have access to data. This requires the implementation of a rigorous security infrastructure for production and test databases.

General systems management and networking skills. Because once databases are implemented they are accessed throughout the organization and interact with other technologies, the DBA must be a jack of all trades. Doing so requires the ability to integrate database administration requirements and tasks with general systems management requirements and tasks (like job scheduling, network management, transaction processing, and so on).

The 17 Skills Required of a DBA 49

ERP and business knowledge. For e-businesses doing Enterprise Resource Planning (ERP) the DBA must understand the requirements of the application users and be able to administer their databases to avoid interruption of business. This sounds easy, but most ERP applications (SAP, Peoplesoft, etc.) use databases differently than homegrown applications. So DBAs require an understanding of how the ERP packaged applications impact the e-business and how the databases used by those packages differ from traditional relational databases. Some typical differences include application-enforced RI, program locking, and the creation of database objects (tables, indexes, etc.) on-the-fly. These differences require different DBA techniques to manage the ERP package effectively.

Extensible data type administration. The functionality of modern DBMSes can be extended using user-defined data types. The DBA must understand how these extended data types are implemented by the DBMS vendor and be able to implement and administer any extended data types implemented in their databases.

Web-specific technology expertise. For e-businesses, DBAs are required to have knowledge of Internet and Web technologies to enable databases to participate in Web-based applications. Examples of this type of technology include HTTP, FTP, XML, CGI, Java, TCP/IP, Web servers, firewalls and SSL. Other DBMS-specific technologies include IBM's Net.Data for DB2 and Oracle Portal (formerly WebDB).

Storage management techniques. The data stored in every database resides on disk somewhere (unless it is stored on one of the new Main Memory DBMS products). The DBA must understand the storage hardware and software available


for use, and how it interacts with the DBMS being used. Storage technologies include raw devices, RAID, SANs and NAS.

Meeting the Demand The number of mission-critical Web-based applications that rely on back-end databases is increasing. Established and emerging e-businesses achieve enormous benefits from the Web/database combination, such as rapid application development, cross-platform deployment and robust, scalable access to data. E-business usage of database technology will continue to grow, and so will the demand for the eDBA. Make sure your organization is prepared to manage its Web-enabled databases before moving them to production. Or be prepared to encounter plenty of problems.

Meeting the Demand 51

Designing Efficient Databases

CHAPTER

7 Design and the eDBA

Welcome to another installment in the ongoing saga of the eDBA. So far in this series of articles, we have discussed eDBA issues ranging including availability and database recovery, new technologies such as Java and XML, and even sources of on-line DBA information. But for this installment we venture back to the very beginnings of a relational database - to the design stage. In this article we will investigate the impact of e-business on the design process and discuss the basics of assuring proper database design.

Living at Web Speed One of the biggest problems that an eDBA will encounter when moving from traditional development to e-business development is coping with the mad rush to "get it done NOW!" Industry pundits have coined the phrase "Internet time" to describe this phenomenon. Basically, when a business starts operating on "Internet time" things move faster. One "Web month" is said to be equivalent to about three standard months. The nugget of truth in this load of malarkey is that Web projects move very fast for a number of reasons:


Because business executives want to conduct more and more business over the Web to save costs and to connect better with their clients. Because someone read an article in an airline magazine

saying that Web projects should move fast. Because everyone else is moving fast so you'd better move

fast, too, or risk losing business. Well, two of these three reasons are quite valid. I'm sure you may have heard other reasons for rapid application development (RAD). And sometimes RAD is required for certain projects. But RAD is bad for database design. Why? Applications are temporary, but the data is permanent. Organizations are forever coding and re-coding their applications - sometimes the next incarnation of an application is being developed before the last one even has been moved to production. But when did you ever throw away data? Oh, sure, you may redesign a database or move from one DBMS to another. But what did you do? Chances are, you saved the data and migrated it from the old database to the new one. Some changes had to be made, maybe some external data was purchased to combine with the existing data, and most likely some parts of the database were not completely populated. But data lives forever. To better enable you to glean value from your data it is wise to take care when designing the database. A well-designed database is easy to navigate and therefore, it is easier to retrieve meaningful data from the database.

Living at Web Speed 53

Database Design Steps The DBA should create databases by transforming logical data models into physical implementation. It is not wise to dive directly into a physical design without first conducting an in-depth examination of the data needs of the business. Data modeling is the process of analyzing the things of interest to your organization and how these things are related to each other. The data modeling process results in the discovery and documentation of the data resources of your business. Data modeling asks the question, "What?" instead of the more common data processing question, "How?" Before implementing databases of any sort, a sound model of the data to be stored in the database should be developed. Novice database developers frequently begin with the quick-and-dirty approach to database implementation. They approach database design from a programming perspective. That is, novices do not have experience with databases and data requirements gathering, so they attempt to design databases like the flat files they are accustomed to using. This is a mistake because problems inevitably occur after the databases and applications become operational in a production environment. At a minimum, performance will suffer and data may not be as readily available as required. At worst, data integrity problems may arise rendering the entire application unusable. A proper database design cannot be thrown together quickly by novices. What is required is a practiced and formal approach to gathering data requirements and modeling data. This modeling effort requires a formal approach to discovering and identifying


identities and data elements. Data normalization is a big part of data modeling and database design. A normalized data model reduces data redundancy and inconsistencies by ensuring that the data elements are designed appropriately. It is actually quite simple to learn the basics of data modeling, but it can take a lifetime to master all of its nuances. Once the logical data model has been created, the DBA uses his knowledge of the DBMS that will be be used to transform logical entities and data elements into physical database tables and columns. To successfully create a physical database design, you will need to have a good working knowledge of the features of the DBMS, including: In-depth knowledge of the database objects supported by

the DBMS and the physical structures and files required to support those objects Details regarding the manner in which the DBMS supports

indexing, referential integrity, constraints, data types, and other features that augment the functionality of database objects Detailed knowledge of new and obsolete features for

particular versions or releases of the DBMS to be used Knowledge of the DBMS configuration parameters that are

in place Data definition language (DDL) skills to translate the

physical design into actual database objects Armed with the correct information, the DBA can create an effective and efficient database from a logical data model. The first step in transforming a logical data model into a physical

Database Design Steps 55

model is to perform a simple translation from logical terms to physical objects. Of course, this simple transformation will not result in a complete and correct physical database design -- it is simply the first step. The transformation consists of the following: Identify and create the physical data structures to be used by

the database objects (for example, table spaces, segments, partitions, and files) Transform logical entities in the data model to physical

tables Transform logical attributes in the data model to physical

columns Transform domains in the data model to physical data types

and constraints Choose a primary key for each table from the list of logical

candidate keys Examine column ordering to take advantage of the

processing characteristics of the DBMS Build referential constraints for relationships in the data

model Reexamine the physical design for performance

Of course, the above discussion is a very quick introduction to and summary of data modeling and database design. Every DBA should understand these topics and make sure that all projects, even e-business projects operating on "Internet time," follow the tried and true steps to database design.


Database Design Traps Okay, so what if you do not practice data modeling and database design? Or what if you'd like to, but are forced to operate on "Internet time" for certain databases? Well, the answer, of course, is "it depends!" The best advice I can give you is to be aware of design failures that can result in a hostile database. A hostile database is difficult to understand, hard to query, and takes an enormous amount of effort to change. Of course, it is impossible to list every type of database design flaw that could be introduced to create a hostile database. But let's examine some common database design failures. Assigning inappropriate table and column names is a common design error made by novices. Database names that are used to store data should be as descriptive as possible to allow the tables and columns to self-document themselves, at least to some extent. Application programmers are notorious for creating database naming problems, such as using screen variable names for columns or coded jumbles of letters and numbers for table names. When pressed for time, some DBAs resort to designing the database with output in mind. This can lead to flaws such as storing numbers in character columns because leading zeroes need to be displayed on reports. This is usually a bad idea with a relational database. It is better to let the database system perform the edit-checking to ensure that only numbers are stored in the column.

Database Design Traps 57

If the column is created as a character column, then the developer will need to program edit-checks to validate that only numeric data is stored in the column. It is better in terms of integrity and efficiency to store the data based on its domain. Users and programmers can format the data for display instead of forcing the data into display mode for storage in the database. Another common database design problem is overstuffing columns. This actually is a normalization issue. Sometimes a single column is used for convenience to store what should be two or three columns. Such design flaws are introduced when the DBA does not analyze the data for patterns and relationships. An example of overstuffing would be storing a person's name in a single column instead of capturing first name, middle initial, and last name as individual columns. Poorly designed keys can wreck the usability of a database. A primary key should be nonvolatile because changing the value of the primary key can be very expensive. When you change a primary key value you have to ripple through foreign keys to cascade the changes into the child table. A common design flaw is using Social Security number for the primary key of a personnel or customer table. This is a flaw for several reasons, two of which are: 1) a social security number is not necessarily unique and 2) if your business expands outside the USA, no one will have a social security number to use, so then what do you store as the primary key? Actually, failing to account for international issues can have greater repercussions. For example, when storing addresses, how do you define zip code? Zip code is USA code but many


countries have similar codes, though they are not necessarily numeric. And state is a USA concept, too. Of course, some other countries have states or similar concepts (Canadian provinces). So just how do you create all of the address columns to assure that you capture all of the information for every person to be stored in the table regardless of country? The answer, of course, is to conduct proper data modeling and database design. Denormalization of the physical database is a design option but it can only be done if the design was first normalized. How do you denormalize something that was not first normalized? Actually, a more fundamental problem with database design is improper normalization. By focusing on normalization, data modeling and database design, you can avoid creating a hostile database. Without proper upfront analysis and design, the database is unlikely to be flexible enough to easily support the changing requirements of the user. With sufficient preparation, flexibility can be designed into the database to support the user's anticipated changes. Of course, if you don't take the time during the design phase to ask the users about their anticipated future needs, you cannot create the database with those needs in mind.

Taming the Hostile Database If data is the heart of today's modern e-business, then the database design is the armor that protects that heart. Data modeling and database design is the most important part of a database application.

Taming the Hostile Database 59

If proper design is not a component of the database creation process, you will wind up with a confusing mess of a database that may work fine for the first application, but not for subsequent applications. And heaven help the developer or DBA who has to make changes to the database or application because of changing business requirements. That DBA will have to try to tame the hostile database.


The eBusiness Infrastructure

CHAPTER

8 E-Business and Infrastructure

Pick up any computer trade journal today and you can't but help read about e-business. It's everywhere. So you read the magazines, attend a course on designing websites, hire a Web master. You buy a server or subscribe to someone else's services, you install some software, and you are ready to have the world of e-business open up to you. Soon your website is up and running. Through clever marketing and well-placed advertisements, you have the world beating down your doorstep. As if by magic the Internet is opening doors for you in places that you have only dreamed about. It all is happening just like the industry pundits said it would. You are ready to go public and retire to the beach. But along the way some things you never read about start to happen. The Web application that was originally designed needs to be changed because the visitors to the website are responding in a manner totally unanticipated. They are looking at things you never intended them to look at. And they are ignoring things that they should be paying attention to. The changes need to be made immediately. Then the volumes of data that are being generated and gathered swamp the system. Entire files are being lost because the system just isn't able to cope.

E-Business and Infrastructure 61

Next Web performance turns sour - right in the middle of the busiest season where the most of the business is being generated. Customers are turned off and sales are being lost. The reality of the Web environment has hit. Creating the Web is one thing. Making it operate successfully on a day-to-day basis is something else. After the initial successes in e-business, management eyes the possibilities and demands that the "new business" be integrated into the "old business". In fact Wall Street looks over your shoulder and dictates that such integration take place. Never before has the IT executive been on the hot seat with so much visibility. Then the volumes of data grow so large the Web can't swallow them. Performance gets worse and files are lost. One day the head of systems decides a new and larger (and much more expensive) computer is needed for the server. But the cost and complexity of upgrading the computer is only the beginning headache. All of today's Web applications and data have to be converted into tomorrow's Web systems. And the conversion must be done with no outages that are apparent to the Web visitors. Immediately after the new computer is brought in, the IT staff announces that a new DBMS is needed. And with the new DBMS comes another Web conversion. Then management introduces a new product line that has to get to the market right away. And that means the applications have to be rewritten and changed. Again. More conversion occurs.


Soon the volumes of data generated by the massive amounts of Web traffic deluge the system, again. And then the network and all of its connections and nodes need to be upgraded. So there needs to be another conversion. And after the network is expanded, there is a need for more memory. Soon the system is once again overwhelmed by the volume of data. But wait! This is what success in e-business is all about. If the business aspects of e-business were not successful, then there would not be all this system churn. It is the success of business that causes system performance, volumes of data, and system integration to become large issues. But is this system churn in the Web environment necessary? Does there have to be all this pain - this constant pain - associated with managing the systems aspect of e-business? The answer is that e-business does not have to be painful at all, even in the face of overwhelming business success. Indeed it is entirely possible to get ahead of the curve and stay ahead of the curve when it comes to building and managing the systems side of e-business. There is nothing about the systems found in e-business that mandates that e-business must operate in a reactive mode. Instead the systems side of e-business is best managed in a proactive mode. The problem is that when most organizations approach building e-business systems they forget everything they ever knew about doing basic data processing. Indeed there is an air about e-business advocates that suggests that e-business technology is new and different and not subject to the forces that have shaped an earlier world of technology.


While there certainly are new opportunities with e-business and while e-business certainly does entail some new technology, the technology behind e-business is as subject to the standard forces of technology as every other technology that has preceded e-business. The secret to becoming proactive in the building and management of e-business systems is understanding, planning, and building the infrastructure that supports e-business. e-business is not just a website. e-business is a website and an infrastructure. When the infrastructure surrounding e-business is not built properly (or not built at all), many problems arise. The following figure suggests that as success occurs on the Web that the website infrastructure becomes increasingly unstable.


But when the infrastructure for e-business is built properly the result is the ability to support long-term growth - - of data, - of transactions, - of new applications, - of change to existing

applications - of integration with existing corporate systems, and so forth. What in fact does the long-term infrastructure for the systems that run the Web based e-business environment look like? The following figure describes what the infrastructure for the systems that run the e-business environment needs to look like

The figure above shows the following components. The Internet connects the corporation to the world. Transactions pass through a firewall as they enter the corporate website. Once past the fire wall the transactions enter the website.


Inside the website the transactions are managed by software that creates a series of html pages that are passed back to the Internet user as part of a session or dialogue. But there are other system components needed for the support of the website. One capability the website has is the ability to create and send transactions to the standard corporate systems environment. When too much data starts to collect in the Web environment it passes out of the Web environment into a granularity manager that in turn passes the now refined and condensed data into a data warehouse. And the website has the ability to access data directly from the corporate environment by means of an ODS. This supporting infrastructure then allows the Web based e-business environment to thrive and grow. With the infrastructure that has been suggested, the Web environment can operate in a proactive mode. (For more information about data warehousing, ODs and other components, please refer to the corporate information factory as found in the book - The Corporate Information Factory, John Wiley, or to the website www.BILLINMON.COM). One of the major problems with the Web environment is that the Web environment is almost never presented as if there were an infrastructure behind the Web that was necessary. The marketing pitch made is that the Web is easy and consists of a website. For tinker toy Web environments this is true. But for industrial strength websites this is not true at all. The Web environment and the supporting infrastructure must be designed and planned carefully and from the outset.


The second major problem with the Web infrastructure is the attitude of many Web development agencies. The attitude is that since the Web is new technology, there is no need to pay attention to older technologies or lessons learned from older technologies. Instead the newness of the Web technology allows the developer to escape from an older environment. This is true to a very limited extent. But once past the immediate confines of the capabilities of new Web technology, the old issues of performance, volumes of data and so forth once again arise, as they have with every successful technology.


Conforming to Your Corporate Structure

CHAPTER

9 Integrating Data in the Web-Based E-Business Environment

In order to be called industrial strength, the Web-based e-business environment needs to be supported by an infrastructure called the corporate information factory. The corporate information factory is able to manage large volumes of data, provide good response time in the face of many transactions, allow data to be examined at both a detailed level and a summarized level, and so forth. Figure 1 shows the relationship of the Web-based e-business environment and the corporate information factory.


Figure 1: How the Web environment and the corporate information factory interface

Integrating Data in the Web-Based E-Business E i

69

Good performance and the management of large amounts of data are simply expected in the Web environment. But there are other criteria for success in the e-business environment. One of those criteria is for the integration of Web-based data and data found in the corporate environment. Figure 2 shows that the

data in the Web environment needs to be compatible with the data in the corporate systems environment.

Figure 2: There needs to be an intersection of web data with corporate information factory data


In addition, there needs to be integration of data across different parts of the Web environment. If the Web environment grows large, it is necessary that there not be different definitions and conventions in different parts of the Web environment. There simply is a major disconnect that occurs when the Web environment uses one set of definitions and structures that are substantively different from the definitions and structures found in corporate systems. When the same part is called "ABC11-09" in the Web environment and is called "187-POy7" in the corporate environment, there is opportunity lost. For many, many reasons it is necessary to ensure that the data found in the Web environment is able to be integrated with the data in the corporate systems environment. Some of the reasons for the importance of integration of Web data and corporate data are: Business can be conducted across the Web environment

and the corporate systems environment at the same time Customers will not have frustration dealing with different

parts of the company Reports can be written that encompass both the Web and

corporate systems The Web environment can take advantage of processes that

are already in place Massive and complex conversion programs do not have to

be written, and so forth. While there are many reasons for the importance of integration, the most important reason is in the ability to use work that has already been done. When the Web environment data is


71

consistent with corporate data, the Web designer is able to use existing systems in situ. But where the data in the Web environment is not compatible with corporate data, the Web designer has the daunting task of writing all systems from scratch. The unnecessary work that is entailed is nothing short of enormous. Specifically what is meant by integration of data across the two environments? At the very least, there must be consistency in the definitions of data, the key structures of data, the encoding structures, reference tables and descriptions. The data that resides in one system must be clearly recognizable in another system and vice verse. The user must see the data as the same, the designer must see the data as the same, and the programmer must see the data as the same. When these parties do not see the data as the same (when in fact the data represents the same thing), then there is a problem. Of course, the converse is true as well. If there is a representation of data in one place it must be consistent with all representations found elsewhere. How exactly is uniformity across the Web-based e-business environment and the corporate systems environment achieved? There are two answers to this question. Integration can be achieved as the Web environment is being built or after the Web environment is already built. By far, the preferable choice is to achieve integration as the Web environment is being built. To achieve cohesion and consistency at the point of initial construction of the Web, integration starts at the conceptual level. Figure 3 shows that as the Web-based systems are being


built, the Web designer builds the Web systems with knowledge of corporate data.

Figure 3: The content, structure, and keys of the corporate systems need to be used in the creation of the Web environment.

Figure 3 shows that the Web designer must be fully aware of the corporate data model, corporate reference tables, corporate data structures and corporate definitions. To build the Web environment in ignorance of these simple corporate conventions is a waste of effort. So the first opportunity to achieve integration is to build the Web environment in conformance with the corporate systems environment at the outset. But in an imperfect world, there are


73

bound to be some differences between the environments. In some cases, the differences are large. In others, the differences are small. The second opportunity to achieve integration across the Web environment and the corporate systems environment is at the point where data is moved from the website to and through the granularity manager and then on into the data warehouse. This is the point where integration is achieved across multiple applications by the use of ETL in the classical data warehouse environment. Figure 4 shows that the granularity manager is used to convert and integrate data as it moves from the Web-based environment to the corporate systems environment. There are of course other tasks that the granularity manager performs.


Figure 4: Of particular interest is the granularity manager which manages the flow of data from the Web environment to the corporate information factory.

Where the Web-based systems have been built in conformance with the corporate systems, the granularity manager is very straightforward and simple in the work it does. But where the Web environment has been built independently from the corporate systems environment, the granularity manager does much complex work as it reshapes the Web data into the form and structure needed by the corporate environment. Does integration of data mean that processing is integrated as well? The answer is that data needs to be consistent across the two environments but processing may or may not be consistent. Undoubtedly there will be some processing that is


75

unique to the Web environment. In the case of unique processing requirements, the processing in the Web environment will be unique. But in the case of non-unique processing in the Web environment it is very advantageous that the processing in the two environments not be separate and apart. Unfortunately, achieving common processing between the two environments when the corporate environment has been built long ago in technology designed for the corporate environment is not easy to do. Far and away the most preferable approach is conforming the Web environment to the corporate systems environment from the outset. Using a little foresight at the beginning saves a huge amount of work and confusion later.


Building Your Data Warehouse

CHAPTER

10 The Issues of the E-Business Infrastructure

In order to have long-term success, the systems of the e-business environment need to exist in an infrastructure that supports the full set of needs. The e-business Web environment needs to operate in conjunction with the corporate information factory. The corporate information factory then is the supporting infrastructure for the Web-based e-business environment that allows for the different issues of operation to be fulfilled. Figure 1 shows the positioning of the Web environment versus the infrastructure known as the corporate information factory.

The Issues of the E-Business Infrastructure 77

Figure 1: How the Web environment is positioned within the corporate information factory

Figure 1 shows the Web has a direct interface to the transaction processing environment. The Web can create and send a transaction based on the interaction and dialogue with the Web user. The Web can access corporate data through requests for


data from the corporate ODS. And data passes out of the Web into a component known as the granularity manager and into the data warehouse. In such a manner the Web-based e-business environment is able to operate in conjunction with the corporate information environment. What then are the issues that face the Web designer/ administrator in the successful operation of the Web e-business environment? There are three pressing issues at the forefront of success. They are: managing the volumes of data that are collected as a by

product of e-business processing establishing and achieving good website performance so

the Internet interaction is adequate to the user integrating e-business processing with other already

established corporate processing These three issues are at the heart of success of the operation of the e-business environment. These issues are not addressed directly inside the Web environment but by a combination of the Web environment interfacing with the corporate information factory.

Large Volumes of Data The biggest and most pervasive challenge facing the Web designer is that of managing the large volumes of data that collect in the Web environment. The large volumes of data are created as a by product of interacting and dialoguing with the many viewers of the website. The data is created in many ways -- by direct feedback to the end users, by transactions created as a result of dialogue, and by capturing the click stream that is created by the dialogues and sessions passing through the Web.

Large Volumes of Data 79

The largest issue of volumes of data by far is that of the click stream data. There is simply a huge volume of data collected as a result of the end user rummaging through the website. One of the issues of click stream data is that much of the data is collected at too low a level of detail. In order to be useful the click stream data must be edited and compacted. One way large volumes of data are accommodated is through the organization of data into a hierarchical structure of storage. The corporate information factory infrastructure allows an almost infinite volume of data to be collected and managed. The very large amount of data that is managed by the Web and the corporate information factory is structured into a hierarchy of storage. Figure 2 illustrates the hierarchy of storage that is created between the Web environment and the corporate information factory.


Figure 2: There is a hierarchy of storage as data flows from the Web environment to the data warehouse environment to the bulk storage data environment.

Figure 2 shows that data flows from the Web environment to the data warehouse and then from the data warehouse to alternative or near line storage. Another way that large volumes of data are handled is by the condensation of data as it passes out of the Web environment and into the corporate information factory. As data passes from the website to the data warehouse the data passes through a granularity manager. The granularity manager performs the function of editing and condensing the Web generated data. Data that is not needed is deleted. Data that needs to be combined is aggregated. Data that is too granular is summarized. The granularity manager has many ways of

Large Volumes of Data 81

reducing the sheer volume of data created in the Web environment. Typically data passes from the Web to the data warehouse every several hours or at least once a day. Once the data is passed to the data warehouse, the space is reclaimed in the Web environment and is made available for the next iteration of Web processing. By clearing out large volumes of data in the Web on a very frequent basis, the Web does not become swamped with data, even during times of heavy access and usage. But data does not remain permanently in the data warehouse. Data passes through the data warehouse on a cycle of every six months to a year. From the data warehouse data is passed to alternative or near line storage. In near line storage, data is collected and stored in what can be termed a "semi archival" basis. Once in near line storage, data remains permanently available. The cost of near line storage is so inexpensive that data can effectively remain there as long as desired. And the capacity of near line storage is such that an unlimited volume of data can be stored. There is then a hierarchy of storage that is created - from the Web to the data warehouse to alternative/near line storage. Some of the characteristics of each level of the hierarchy are: Web - very high probability of access, very current data (24

hours), very limited volume Data Warehouse - moderate probability of access, historical

data (from 24 hours to six months old), large volumes


Alternative/Near Line Storage - low probability of access, deep historical data (ten years or more), very, very large volumes of data

The hierarchy of data storage that has been described is capable of handling even the largest volumes of data.

Performance Performance in the Web e-business environment is a funny thing. Performance is vital to the success of the Web-based e-business environment because in the Web-based e-business environment the Web IS the store. In other words in the Web-based e-business environment when there is a performance problem there is no place to hide. The first person to know about the performance problem is the Internet-based user. And that is the last person you want noticing performance problems. In a sense the Web-based e-business environment is a naked environment. If there is a performance problem there is no place to hide. The effect of poor performance in the e-business environment is immediate and is always negative. For these reasons then it behooves the Web designer to pay special attention to the performance characteristics of the applications run under the Web. Performance in the Web environment is achieved is many ways. Certainly the careful management of large volumes of data, as previously discussed, has its own salubrious effect for performance. But there are many other ways that performance is achieved in the Web environment. One of the most important ways performance is achieved is in terms of the interface to the corporate information factory. The primary interface to the

Performance 83

corporate information factory is through the corporate ODs Figure 3 shows this interface.

Figure 3: The interface from the data warehouse environment to the Web environment is by way of the corporate ODs

It is never an appropriate thing for the Web to directly access the data warehouse. Instead the interface for the accessing of corporate data is done through the ODs which is a structure and a technology designed for high performance access of data. As important as access to corporate data is to performance there are a whole host of other design practices that must be adhered to - breaking down large transactions into a series of smaller transactions - minimizing the I/O's needed for processing, - summarizing/aggregating data - physically grouping data together, and so forth.


Integration The Website can be built in isolation from corporate data. But when the Web is built in isolation, there is no business integration that can occur. The key structures, the descriptions of data, the common processing - all of these issues mandate - from a business perspective - that the Web not be built in isolation from other corporate systems. The corporate information factory supports this notion by allowing the data from the Web environment to be entered into the data warehouse and integrated with other data from the corporate environment. Figure 4 shows this intermixing of data.

Integration 85

Figure 4: Corporate data is integrated with the Web data when they meet inside the data warehouse.

Figure 4 shows that data from the Web passes into the data warehouse. If the data coming from the Web has used common key structures and definitions of data, then the granularity manager has a simple job to do. But if the Web designer has used unique conventions and structures for the Web environment, then it is the job of the granularity manager


to convert and integrate the Web data into a common corporate format and structure. The focus of the data warehouse is to collect only integrated data. When the data warehouse is used as a garbage dump for unintegrated data, the purpose of the warehouse is defeated. Instead, it is mandatory that all data - from the Web or otherwise - need be integrated into the data warehouse.

Addressing the Issues There are some major issues facing the Web designer.

Those issues are the volumes of data created by the Web processing, the performance of the Web environment, and the need for integration of Web data with other corporate data. The juxtaposition of the Web environment to the corporate information factory allows those issues to be addressed.

Addressing the Issues 87

The Importance of Data Quality Strategy

CHAPTER

11 Develop a Data Quality Strategy Before Implementing a Data Warehouse

The importance of data quality with respect to the strategic planning of any organization cannot be stressed enough. The Data Warehousing Institute, (TDWI), in a recent report, estimates that data quality problems currently cost U.S. businesses $600 billion each year. Time and time again, however, people claim that they can’t justify the expense of a Data Quality Strategy. Others simply do not acknowledge the benefits. While a data quality strategy is important, it takes on new significance when implementing a data warehouse. The effectiveness of a data warehouse is based upon the quality of its data. The data warehouse itself does not do a satisfactory job of cleansing data. The same data would need to be cleansed repeatedly during iterative operations. The best place to cleanse data is in production, before loading it to the data warehouse. By cleansing data in production instead of in the data warehouse, organizations save time and money

Data Quality Problems in the Real World The July 1, 2002 edition of the USA Today newspaper ran an article entitled, "Spelling Slows War on Terror." It demonstrates how hazardous data (poor data quality) can hurt an organization. In this case, the organization is the USA, and


we are all partners. The article cites confusion over the appropriate spelling of Arab names and links this confusion to the difficulty U.S. intelligence experiences in tracking these suspects. The names of individuals, their aliases, and their alternative spellings are captured by databases from the FBI, CIA, Immigration and Naturalization Service (INS), and other agencies.

Figure 1: Data flow in government agencies.

Figure 1 clearly shows that the data flow between organizations is truly nonexistent. A simple search for names containing, "Gadhafi" returns entirely different responses from each data source.

Why Data Quality Problems Go Unresolved Problems with data quality are not unique to government; no organization, public or private, is immune to this problem. Excuses for doing nothing about it are plentiful:

Why Data Quality Problems Go Unresolved 89

It costs too much to replace the existing systems with data-sharing capability. We could build interfaces into the existing systems, but no

one really understands the existing data architectures of the systems involved. How could we possibly build a parser with the intelligence

to perform pattern recognition for resolving aliases, let alone misspellings and misidentifications? There is simply no way of projecting return on investment

for an investment such as this. Quite similarly, the USA Today article cited the following three problems, identified publicly by the FBI and privately by CIA and INS officials: Conflicting methods are used by agencies to translate and

spell the same name. Antiquated computer software at some agencies won't allow

searches for approximate spellings of names. Common Arabic names such as Muhammed, Sheik, Atef,

Atta, al-Haji, and al-Ghamdi add to the confusion (i.e., multiple people share the same name, such as "John Doe").

Note the similarity of these two lists.

Fraudulent Data Quality Problems To further complicate matters, a recent New York Times article published July 10, 2002 confirmed that at least 35 bank accounts had been acquired by the September 11, 2001 highjackers during the prior 18 months. The highjackers used stolen or fraudulent data such as names, addresses and social security numbers.


The Seriousness of Data Quality Problems It can be argued that, in most cases, the people being tracked are relative unknowns. Unfortunately, the problem is not as uncommon. In fact, a CIA official conducting a search on Libyan leader Moammar Gadhafi found more than 60 alternate spellings of his name. Some of the alternate spellings are listed in Table 1. Alternate Spellings of Libyan Leader's Surname

ALTERNATE SPELLINGS OF LIBYAN LEADERS’ SURNAME

1 Qadhafi 2 Qaddafi 3 Qatafi 4 Quathafi 5 Kadafi 6 Kaddafi 7 Khadaffi 8 Gadhafi 9 Gaddafi 10 Gadafy

Table 1: Alternate spellings of Libyan leader's surname. In this example, we are talking about someone who is believed to have supported terrorist-related activities and is the leader of an entire country, yet we still cannot properly identify him. Note that this example was obtained through the sampling of CIA data only-imagine how many more alternate spellings of Gadhafi one would find upon integrating FBI, INS, and other sources. Fortunately, most of us are not trying to save the world, but data quality might save our business!

The Seriousness of Data Quality Problems 91

Data Collection Whether you're selling freedom or widgets, whether you service tanks or SUVs, you have been collecting data for a long time. Most of this data has been collected in an operational context, and the operational life span (approximately 90 days) of data is typically far shorter than its analytical life span (endless). This is a lot of data with a lot of possibilities for quality issues to arise. Chances are high that you have data quality issues that need to be resolved before you load data into your data warehouse.

Solutions for Data Quality Issues Fortunately, there are multiple options available for solving data quality problems. We will describe three specific options here: Build an integrated data repository. Build translation and validation rules into the data-collecting

application. Defer validation until a later time.

Option 1: Integrated Data Warehouse The first and most unobtrusive option is to build a data warehouse that integrates the various data sources, as reflected in the center of Figure 2.


Figure 2: Integrated data warehouse. An agreed-upon method for translating the spellings of names would be universally applied to all data supplied to the Integrated Data Warehouse, regardless of its source. Extensive pattern recognition search capability would be provided to search for similar names that may prove to be aliases in certain cases. The drawback here is that the timeliness of quality data is delayed. It takes time for each source to collect its data, then submit it to the repository where it can then be integrated. The cost of this integration time frame will be different depending on the industry you are involved in. Clearly, freedom fighters need high quality data on very short notice. Most of us can probably wait a day or so to find out if John Smith has become our millionth customer or whatever the inquiry may be.

Solutions for Data Quality Issues 93

Option 2: Value Rules In many cases, we can afford to build our translation and validation rules into the applications that initially collect the data. The obvious benefit of such an approach is the expediency of access to high quality data. In this case, the agreed-upon method for translating data is centrally constructed and shared by each data collection source. These rules are applied at the point of data collection, eliminating the translation step of passing data to the Integrated Data Warehouse. This approach does not alleviate the need for a data warehouse, and there will still be integration rules to support, but improving the quality of data at the point it is collected considerably increases the likelihood that this data will be used more effectively over a longer period of time.

Option 3: Deferred Validation Of course, there are circumstances where this level of validation simply cannot be levied at the point of data collection. For example, an online retail organization will not want to turn away orders upon receipt because the address isn't in the right format. In such circumstances, a set of deferred validation routines may be the best approach. Validation still happens in the systems where the data is initially collected, but does not interfere with the business cycle.

Periodic sampling averts future disasters The obvious theme of this article is to develop thorough data rules and implement them as close to the point of data collection as feasible to ensure an expected level of data quality.


But what happens when a new anomaly crops up? How will we know if it is slowly or quickly becoming a major problem? There are many examples to follow. Take the EPA, which has installed monitors of various shapes and sizes across the continental U.S. and beyond. The monitors take periodic samples of air and water quality and compare the sample results to previously agreed-upon benchmarks. This approach proactively alerts the appropriate personnel as to when an issue arises and can assess the acceleration of the problem to indicate how rapidly a response must be facilitated. We too must identify the data elements that contain the most critical data sources we manage and develop data quality monitors that periodically sample the data and track quality levels. These monitors are also good indicators of system stability, having been known to identify when a given system component is not functioning properly. For example, I've seen environments in retail where the technology was not particularly stable and caused orders to be held in a Pending Status for days. A data quality monitor tracking orders by status would detect this phenomenon early, track its adverse effect, and notify the appropriate personnel when the pre-stated threshold has been reached. Data quality monitors can also be good business indicators. Being able to publish statistics on the number of unfulfilled orders due to invalid addresses or the point in the checkout process at which most customers cancel orders can indicate places where processes can be improved.

Periodic sampling averts future disasters 95

Conclusion A sound Data Quality Strategy can be developed in a relatively short period of time. However, this is no more than a framework for how the work is to be carried out. Do not be mistaken - commitment to data quality cannot be taken lightly. It is a mode of operation that must be fully supported by business and technology alike.


Data Modeling and eBusiness

CHAPTER

12 Data Modeling for the Data Warehouse

In order to be effective, data warehouse developers need to show tangible results quickly. At the same time, in order to build a data warehouse properly, you need a data model. And everyone knows that data models take huge amounts of time to build. How then can you say in the same breath that a data model is needed in order to build a data warehouse and that a data warehouse should be built quickly? Aren't those two statements completely contradictory? The answer is -- not at all. Both statements are true and both statements do not contradict each other if you know what the truth is and understand the dynamics that are at work.

"Just the Facts, Ma'am" Consider several facts.

Fact 1 -- when you build a data model for a data warehouse you build a data model for only the primitive data of the corporation. Fig 1 suggests a data model for primitive data of the data warehouse.

Data Modeling for the Data Warehouse 97

Figure 1: Data model for a data warehouse.

Modeling Atomic Data The first reaction to data modeling that most people have is that the data model for a data warehouse must contain every permutation of data possible because, after all, doesn't the data warehouse serve all the enterprise? The answer is that the data in the warehouse indeed serves the entire corporation. But the data found in the data warehouse is at the most atomic level of data that there is. The different summarizations and aggregations of data found in the enterprise -- all the permutations -- are found outside the data warehouse in data marts, DSS applications, ODS, and so forth. Those summarizations and aggregations of data that are constantly changing are not found at the atomic, detailed level of data in the data warehouse. The different permutations and


summarizations of data that are inherent to doing informational processing are found in the many other parts of the corporate information factory, such as the data marts, the DSS applications, the exploration warehouse, and so forth. Because the data warehouse contains only the atomic data of the corporation, the data model for the data warehouse is very finite. For example, a data warehouse typically contains basic transaction information – item sold -- bolts, amount -- 12.98, place -- Long's Drug, Sacramento, shipment -- carry, unit -- by item, part number -- aq4450-p item sold -- string -- 10.00, place -- Safeway, Dallas, shipment -- courier, unit -- by ball, part number -- su887-p1 item sold -- plating, amount -1090.34, place -- Emporium, Baton Rouge, shipment -- truck, unit -- by sq ft, part number -- pl9938-re6 item sold -- mount, amount -10000.00, place -- Ace Hardware, Texarkana, shipment -- truck, unit -- by item, part number -- we887-qe8 item sold -- bolts, amount -122.67, place -- Walgreens, El Paso, shipment -- train, unit -- by item, part number- aq4450-p .......................................................

The transaction information is very granular. The data model only need concern itself with very basic data. What is not found in the data model for the data warehouse is information such as: monthly revenue by part by region quarterly units sold by part weekly revenue by region discount ratio for bulk orders by month shelf life of product classifications by month by region

"Just the Facts, Ma'am" 99

The data model for the data warehouse simply does not have to concern itself with these types of data. Derived data, summarized data, aggregated data all are found outside the data warehouse. Therefore the data model for the data found in the data warehouse does not need to specify all of these permutations of basic atomic data. Furthermore, the data found in the data warehouse is very stable. It changes only very infrequently. It is the data outside of the data warehouse that changes. This means that the data warehouse data model is not only small but stable.

Through Data Attributes, Many Classes of Subject Areas Are Accumulated

Fact 2 -- The data attributes found in the data in the data warehouse should include information so that the subjects described can be interpreted as broadly as possible. In other words, the atomic data found in the data warehouse should be as far-reaching and as widely representative of as many classes and categories of data as possible. For example, suppose the data modeler has the subject area -- CUSTOMER. Is the data model for customer supposed to be for an existing customer? for a past customer? for a future customer? The answer is that the subject area CUSTOMER -- if modeled properly at the data warehouse level -- should include attributes that are representative of ALL types of customers, not just one type of customer. Attributes should be placed in the subject data model so that: the date a person became a customer is noted the date a person was last a customer is noted


if the person was ever a customer is noted.

By placing all the attributes that might be needed to determine the classification of a customer for the subject area in the data model, the data modeller has prepared for future contingency for the data. Ultimately the DSS analyst doing informational processing can use the attributes found in CUSTOMER data to look at past customers, future or potential customers, and current customers. The data model prepares the way for this flexibility by placing the appropriate attributes in the atomic data of the data warehouse. As another example of placing many attributes in atomic data, a part number might include all sorts of information about the part, whether the information is directly needed by current requirements or not. The part number might include attributes such as - part number unit of measure technical description business description drawing number part number formerly known as engineering specification bill of material into bill of material from wip or raw goods precious good store number replenishment category

"Just the Facts, Ma'am" 101

lead time to order weight length packaging accounting cost basis assembly identification

Many of these attributes may seem extraneous for much of the information processing that is found in production control processing. But by attaching these attributes to the part number in the data model, the way is paved for future unknown DSS processing that may arise. Stated differently, the data model for the data warehouse tries to include as many classifications of data as possible and does not exclude any reasonable classification. In doing so, the data modeler sets the stage for all sorts of requirements to be satisfied by the data warehouse. From a data model standpoint then, the data modeler simply models the most atomic data with the widest latitude for interpretation. Such a data model can be easily created and represents the corporation's most basic, most simple data. Once defined this way in the data model, the data warehouse is prepared to handle many requirements, some known, some unknown. For these reasons then, creating the data model for the data warehouse is not a horribly laborious task given the parameters of modeling only atomic data and putting in attributes that allow the atomic data to be stretched any way desired.


Other Possibilities -- - Generic Data Models But who said that the data model had to be created at all? There is tremendous commonality across companies in the same industry, especially when it comes to the most basic, most atomic data. Insurance data is insurance data. Banking data is banking data. Railroad data is railroad data, and so forth. Why go to one company, create a data model for them, then turn around and go to a different company in the same industry and create essentially the same data model? Does this make sense? Instead why not look for generic industry and functional data models. A generic data model is one that applies to an industry, rather than to a specific company. Generic data models are most useful when applied to the atomic data found in the data warehouse. There are several good sources of generic data models - www.billinmon.com The Data Model Resource Book, by Len Silverston, and so

forth In some cases these generic data models are for free, in other cases these generic data models cost a small amount of money. In any case, having a data model pre built to start the development process off with makes sense because it puts the modeller in a position of being an editor rather than a creator, and human beings are always naturally more comfortable editing rather than creating. Put a blank sheet of paper in front of a person and that person sits there and stares at it. But put something on that sheet of paper and the person immediately and naturally starts to edit. Such is human nature.

Other Possibilities -- - Generic Data Models 103

Design Continuity from One Iteration of Development to the Next

But there is another great value of the data model to the world of data warehousing and that value is that it is the data model that provides design continuity. Data warehouses -- when built by a knowledgeable developer -- are built incrementally, in iterations. First one small iteration of the data warehouse is built. Then another iteration is built, and so forth. How do these different development efforts know that the product that is being produced will be tightly integrated? How does one development team know that it is not stepping on the toes of another development team? The data model is how the different development teams work together -- independently -- but never the less on the same project without overlap and conflict. The data model becomes the cohesive driving force in the building of a data warehouse -- the intellectual roadmap -- that holds the different development teams together. These then are some considerations with regard to the data model for the data warehouse environment. Some one who tells you that you don't need a data model has never built a data warehouse before. And likewise, someone who tells you that the data model for the data warehouse is going to take eons of time to build has also never built a data warehouse. In truth, people that build data warehouses have data models and they build their data warehouses in a reasonable amount of time. It happens all the time.


Don't Forget the Customer

CHAPTER

13 Interacting with the Internet Viewer

The Web-based e-business environment is supported by an infrastructure called the corporate information factory. The corporate information factory provides many capabilities for the Web, such as the ability to handle large volumes of data, have good and consistent performance, see both detail and summary information, and so forth. Figure 1 shows the corporate information factory infrastructure that supports the Web-based e-business environment.

Interacting with the Internet Viewer 105

Figure 1: The corporate information factory and the Web-based e-business environments


The corporate information factory also provides the means for a very important feedback loop for the Web processing environment. It is through the corporate information factory that the Web is able to "remember" who has been to the web site.

Once having remembered who has been to the website, the Web analyst is able to tailor the dialogue with the consumer to best meet the consumers needs. The ability to remember who has been to the website allows the Web analyst to greatly customize the HTML pages that the Web viewer sees and in doing so achieves a degree of "personalization". The ability to remember who has been to a site and what they have done is at the heart of the opportunity for cross selling, extensions to existing sales, and many other marketing opportunities. In order to see how this extended feedback loop works, it makes sense to follow a customer through the process for a few transactions. Figure 2 shows the system as a customer enters the system through the Internet for the first time.

Figure 2: How an interaction enters the website.


Step 1 in Figure 2 shows that the customer has discovered the website. The customer enters through the firewall. Upon entering the cookie of the customer is identified. The Web

manager asks if the cookie is known to the system at Step 2 of Figure 2. The answer comes back that the cookie is unknown since it is the first time the customer has been into the site. The customer then is returned a standard dialogue that has been prepared for all new entrants to the website. Of course, based on the interactions with the customer, the dialogue soon is channeled in the direction desired by the customer. The results of the dialogue - from no interaction at all to an extended and complex dialogue are recorded along with the cookie and the date and time of the interaction. If the dialogue throws off any business transactions - such as a sale or an order - then those transactions are recorded as well as the click stream information. The results of the dialogue end up in the data warehouse, as seen in Step 3 of Figure 2. The data passes through the granularity manager where condensation occurs. In the data warehouse a detailed account of the dialogue is created for the cookie and for the date of interaction. Then, periodically the Web analyst runs a program that reads the detailed data found in the data warehouse. This interaction is shown in Figure 3.


Figure 3: The content, structure, and keys of the corporate systems need to be used in the creation of the Web environment.

The Web analyst reads the detailed data for each cookie. In the case of the Internet viewer who has had one entry into the website, there will be only one set of detailed data reflecting the dialogue that has occurred. But if there have been multiple entries by the Internet viewer, the Web analyst would consider each of them. In addition if the Web analyst has available other data about the customer, that information is taken into consideration as well. This analysis of detailed historical data is shown in Figure 3, Step 1. Based on the dialogues that have occurred and their recorded detailed history, the Web analyst prepares a "profile" record for


each cookie. The profile record is placed in the ODS as seen in Figure 3, Step 2. The first time through a profile record is created. Thereafter the profile record is updated. The profile record can contain anything that is of use and of interest to the sales and marketing organization. Some typical information that might be found in the profile record might include: cookie id date of last interaction total number of sessions last purchase type last purchase amount name (if known) address (if known) url (if known) items in basket not purchased item 1 item 2 item 3 ..... classification of interest interest type 1 interest type 2 interest type 3 ...............


buyer type The profile record can contain anything that will be of use to the sales and marketing department in the preparation of effective dialogues. The profile record is written or updated to the ODS. The profile analysis can occur as frequently or as infrequently as desired. Profile analysis can occur hourly or can occur monthly. When profile analysis occurs monthly there is the danger that the viewer will return to the website without having the profile record being up to date. If this is the case, then the customer will appear as if he/she were a cookie that is unknown to the system. If profile creation is done hourly, the system will be up to date, but the overhead of doing profile analysis will be considerable. When doing frequent profile analysis, only the most recent units of information are considered in the creation and update of the profile. The tradeoff then is between the currency of data and the overhead needed for of profile analysis. Once the profile record is created in the ODS it is available for immediate usage. Figure 4 shows what happens when the Internet viewer enters the system for the second, third, fourth, etc. times.


Figure 4: Of particular interest is the granularity manager that manages the flow of data from the Web environment to the corporate information factory. The viewer enters the system through the fire wall. Figure 4, Step 1 shows this entry. Control then passes to the Web manager and the first thing the Web manager does is to determine if the cookie is known to the system. Since this is the second (or later) dialogue for the viewer there is a cookie record for the viewer. The Web manager goes to the ODS and finds that indeed the cookie is known to the system. This interaction is shown in Figure 4, Step 2. The profile record for the customer is returned to the Web manager in Figure 4, Step 3. Now the Web manager has the


profile record and can start to use the information in the profile record to tailor the information for the dialogue. Note that the amount of time to access the profile record is measured in milliseconds.The time to analyze the profile record and prepare a customized dialogue is even faster. In other words, the Web manager can get a complete profile of a viewer without having to blink an eye. The ability to get a profile record very, very quickly means that the profile record can be part of an interactive dialogue where the interactive dialogue occurs in sub second time. Good performance from the perspective of the user is the result.

IN SUMMARY The feedback loop that has been described fulfills the needs of dialogue management in the Internet management. The feedback loop allows: each customer's records at the detailed level to

be analyzed access to summary and aggregate information to be

made in subsecond time records to be created each time new information is available, and so forth.

IN SUMMARY 113

Getting Smart CHAPTER

14 Elasticity and Pricing: Getting Smart

The real potency of the eBusiness environment is opened up when the messages sent across the eBusiness environment start to get smart. And just how is it that a company can get smart about the messages it sends? One of the most basic approaches to getting smart is in pricing your products right. Simply stated, if you price your products too high you don't sell enough units to maximize profitability. If you price your products too low, you sell a lot of units, but you leave money on the table. So the smart business prices its products just right. And how exactly are products priced just right? The genesis of pricing products just right is the integrated historical data that resides in the data warehouse. The data warehouse contains a huge amount of useful sales data. Each sales transaction is recorded in the data warehouse.

Historically Speaking By looking at the past sales history of an item, the analyst can start to get a feel for the price elasticity of the item. Price elasticity refers to the sensitivity of the sale to the price of the product. Some products sell well regardless of their price and other products are very sensitive to pricing. Some products sell well when the price is low but sell poorly when the price is high.


Consider the different price elasticity of two common products - milk and bicycles.

MILK PRICE $2.25/gallon 560 units sold $2.15/gallon 585 units sold $1.95/gallon 565 units sold $1.85/gallon 590 units sold $1.75/gallon 575 units sold $1.65/gallon 590 units sold

BICYCLES $400 16 units sold $390 15 units sold $380 19 units sold $370 21 units sold $360 20 units sold $350 23 units sold $340 24 units sold $330 26 units sold $320 38 units sold $310 47 units sold $300 59 units sold $290 78 units sold

Milk is going to sell regardless of its price. (Actually this is not true. At some price - say $100 per gallon -even milk stops selling.) But within the range of reasonable prices, milk is price inelastic. Bicycles are another matter altogether. When bicycles are priced low they sell a lot. But the more the price is raised the fewer units are sold. By looking at the past sales, the business analyst starts to get a feel for what is the price elasticity of a given product.

Historically Speaking 115

At the Price Breaking Point But price elasticity is not the only important piece of information that can be gleaned from looking at past sales information. Another important piece of information that can be gathered is the "price break" point for a product. For those products that are price elastic, there is a point at which the maximum number of units will be sold. This price point is the equivalent of the economic order quantity (the "EOQ"). The price break point can be called the economic sale price (the "ESP"). The economic sale price is the point at which no more marginal sales will be made regardless of the lowering of the price. In order to find the ESP, consider the following sale prices for a washing machine:

WASHING MACHINE 500 20 units 475 22 units 450 23 units 425 20 units 400 175 units 375 180 units 350 195 units 325 200 units 300 210 units 275 224 units

In the following simple (and somewhat contrived) example, the ESP is clearly at $400. If the merchant prices the item above $400, then the merchant is selling fewer units than is optimal. If the merchant prices the item at lower than $400, then the


merchant will move a few more items but not many more. The merchant is in essence leaving money on the table by pricing the item lower than $400. If the price/unit points were to be graphed, there would be a knee in the curve of the graph at $400, and that is where the ESP is located. Stated differently, the merchant will move the most number of units at the highest price by discovering the ESP. It doesn't take a genius to see that finding which items are elastic and finding the ESP of those items that are elastic is the equivalent of printing money as far as the merchant is concerned.

How Good Are the Numbers And the simple examples shown here are representative of the kinds of decisions that merchants make every day. But there are some factors that the analyst had better be willing to take into account. The first factor is the purity of the numbers. Suppose an analyst is presented with the following sales figures for a product –

$100 1 unit sold $75 10,000 units sold $50 2 units sold

What should the analyst make of these figures? According to theory, there should be an ESP at $75. But looking into the sales of the product, the business analyst finds that except for one day in the life of the corporation, the product has never been on sale at anything other than $75. On the day after Christmas two items were marked down and they sold quickly. And one item was marked up by accident and just happened to

How Good Are the Numbers 117

sell. So these numbers have to be interpreted very carefully. Drawing the conclusion that $75 is the ESP may be a completely fallacious conclusion. To be meaningful, the sales and the prices at which the sales have been made need to be made in a laboratory environment. In other words, when examining sales, the sales price needs to have been presented to the buying public at many different levels for an equal time in order for the ESP to be established. Unfortunately this is almost never the case. Stores are not laboratories, and products and sales are not experiments. To mitigate the fact that sales are almost never made in a laboratory manner, there are other important measurements that can be made. These measurements, which also indicate the price elasticity of an item, include stocking-to-sale time and marginal sales elasticity.

How Elastic Is the Price The stocking-to-sale time is a good indicator of the price elasticity of an item because it indicates the demand for the item regardless of other conditions. To illustrate the stocking-to-sale time for an item, consider the following simple table:

$200 35 days $175 34 days $150 36 days $125 31 days $100 21 days $75 20 days $50 15 days


Note that in this example, there is no need to look at total number of items sold. Total number of items sold can vary all over the map based on the vagaries of a given store. Instead, the elasticity of the product is examined through the perspective of how pricing affects the length of time an item is on the shelves. Realistically, this is probably a much better measurement of the elasticity of an item than total units sold given that there are many other factors that relate to total items sold. Another way to address the elasticity of an item is through marginal units of sale per unit drop in price. Suppose that a merchant does not have a wealth of data to examine and suppose that a merchant does not have a laboratory with which to do experiments on pricing (both reasonable assumptions in the real world). What the merchant can do is to keep careful track of the sales of an item at one price, then drop the price and keep track of the sales at the new price. In doing so, the merchant can get a good feel for the elasticity of an item without having massive figures stored over the years. For example consider two products - product A and product B. The following sales patterns are noted for the two products: Product A $250 20,000 units sold $200 25,000 units sold Product B $100 5,000 units sold $90 5,010 units sold Based on these two simple measurements, the merchant can draw the conclusion that Product A is price elastic and that Product B is price inelastic. The merchant does not need a laboratory or more elaborate measurements to determine the

How Elastic Is the Price 119

elasticity of the products. In other words, these simple measurements can be done in a real-world environment. And exactly where does the merchant get the numbers for elasticity analysis? The answer, of course, is a data warehouse. The data warehouse contains detailed, integrated, historical data which is of course exactly what the business analyst needs to affect these analyses.

Conclusion Once the price elasticity of items is known, the merchant knows just how to price the item. And once the merchant knows exactly how to price an item, the merchant is positioned to make money. The Web and eBusiness now are positioned to absolutely maximize sales and revenue. However, note that if the products are not priced properly, the Web accelerates the rate at which the merchant loses money. This is what is meant by being smart about the message you put out on the Web. The Web accelerates every thing. It either allows you to make money faster than ever before or lose money faster than ever before. Whether you make or lose money depends entirely on how smart you are about what goes out over the Web.


Tools of the Trade: Java

CHAPTER

15 The eDBA and Java

Welcome to another installment of our eDBA column where we explore and investigate the skills required of DBAs as their companies move from traditional business models to become e-businesses. Many new technologies will be encountered by organizations as they morph into e-businesses. Some of these technologies are obvious such as connectivity, networking, and basic web skills. But some are brand new and will impact the way in which an eDBA performs her job. In this column and next month's column I will discuss two of these new technologies and the impact of each on the eDBA. In this month we discuss Java: next time, XML. Neither of these columns will provide an in-depth tutorial on the subject. Instead, I will provide an introduction to the subject for those new to the topic, and then describe why an eDBA will need to know about the topic and how it will impact their job.

What is Java? Java is an object-oriented programming language. Originally developed by Sun Microsystems, Java was modeled after, and most closely resembles, C++. But it requires a smaller footprint and eliminates some of the more complex features of C++ (e.g. pointer management). The predominant benefit of the Java programming language is portability. It enables developers to write a program once and run it on any platform, regardless of hardware or operating system.

The eDBA and Java 121

An additional capability of Java is its suitability for enabling animation for and interaction with web pages. Using HTML, developers can run Java programs, called applets, over the web. But Java is a completely different language than HTML, and it does not replace HTML. Java applets are automatically downloaded and executed by users as they surf the web. But keep in mind that even though web interaction is one of its most touted features, Java is a fully functional programming language that can be used for developing general-purpose programs, independent from the web. What makes Java special is its multi-platform design. In theory, regardless of the actual machine and operating system that you are using, a Java program should be able to run on it. Many possible benefits accrue because Java enables developers to write an application once and then distribute it to be run on any platform. These benefits can include reduced development and maintenance costs, lower systems management costs, and more flexible hardware and software configurations. So, to summarize, the major qualities of Java are: its similarity to other popular languages its ability to enable web interaction its ability to enable executable web content its ability to run on multiple platforms

Why is Java Important to an eDBA? As your organization moves to the web, Java will gain popularity. Indeed, the growth of Java usage in recent years is almost mirroring the growth of e-business (see Figure 1).


Figure 1: The Java Software Market (in US$) - Source: IDC

So Java will be used to write web applications. And those web applications will need to access data- which is invariably stored in a relational database. And, as DBAs, we know that when programs meet data that is when the opportunity for most performance problems is introduced. So, if Java is used to develop web-based applications that access relational data, eDBAs will need to understand Java. There is another reason why Java is a popular choice for web-based applications. Java can enhance application availability. And, as we learned in our previous column, availability is of paramount importance to web-based applications.

How can Java improve availability? Java is a late-binding language. After a Java program is developed, it is compiled. But the compiler output is not pure executable code. Instead, the compiler produces Java bytecodes. This is what enables Java to be so portable from

How can Java improve availability? 123

platform to platform. The Java bytecodes are interpreted by a Java Virtual Machine (JVM). Each platform has its own JVM. The availability aspect comes into play based on how code changes are introduced. Java code changes can be deployed as components, while the application is running. So, you do not need to stop the application in order to introduce code changes. The code changes can be downloaded over the web as needed. In this way, Java can enhance availability. Additionally, Java simplifies complicated turnover procedures and the distribution and management of DLL files required of client/server applications.

How Will Java Impact the Job of the eDBA? One of the traditional roles of the DBA is to monitor and manage the performance of database access. With Java, performance can be a problem. Remember that Java is interpreted at run time. A Java program, therefore, is usually slower than an equivalent traditional, compiled program. Just In Time (JIT) compiler technology is available to enable Java to run faster. Using a JIT compiler, bytecodes are interpreted into machine language just before they are executed on the platform of choice. This can enhance the performance of a Java program. But a JIT compiler does not deliver the speed of a compiled program. The JIT compiler is still an interpretive process and performance may still be a problem. Another approach is a High Performance Java (HPJ) compiler. The HPJ compiler turns bytecodes into true load modules. It avoids the overhead of interpreting Java bytecodes at runtime. But not all Java implementations provide support JIT or HPJ compilers.


As an eDBA, you need to be aware of the different compilation options, and provide guidelines for the development staff as to which to use based on the availability of the technology, the performance requirements of the application, and the suitability of each technique to your shop. Additionally, eDBAs will need to know how to access databases using Java. There are two options: JDBC SQLJ

JDBC is an API that enables Java to access relational databases. Similar to ODBC, JDBC consists of a set of classes and interfaces that can be used to access relational data. Anyone familiar with application programming and ODBC (or any call-level interface) can get up and running with JDBC fairly quickly. JDBC provides dynamic SQL access to relational databases. The intended benefit of JDBC is to provide vendor-independent connections to relational databases from Java programs. Using JDBC, theoretically at least, you should be able to write an application for one platform, say DB2 for OS/390, and deploy it on other platforms, for example, Oracle8i on Sun Solaris. Simply by using the correct JDBC drivers for the database platform, the application should be portable. Of course, this is in theory. In the real world you need to make sure you do not use any platform-specific extensions or code for this to work.

How Will Java Impact the Job of the eDBA? 125

SQLJ provides embedded static SQL for Java. With SQLJ, a translator must process the Java program. For those of you who are DB2 literate, this is just like precompiling a COBOL program. All database vendors plan to use the same generic translator. The translator strips out the SQL from the Java code

so that it can be optimized into a database request module. It also adds Java code to the Java program, replacing the SQL calls. Now the entire program can be compiled into bytecodes, and a bind can be run to create a package for the SQL. So which should you use? The answer, of course, is "it depends!" SQLJ has a couple of advantages over JDBC. The first advantage is the potential performance gain that can be achieved using static SQL. This is important for Java because Java has a reputation for being slow. So if the SQL can be optimized prior to runtime, the overall performance of the program should be improved. Additionally, SQLJ is similar to the embedded SQL programs. If your shop uses embedded SQL to access DB2, for example, then SQLJ will be more familiar to your programmers than JDBC. This familiarity could make it easier to train developers to be proficient in SQLJ than in JDBC. However, you can not use SQLJ to write dynamic SQL. This can be a drawback if you desire the flexibility of dynamic SQL. However, you can use both SQLJ and JDBC calls inside of a single program. Additionally, if your shop uses ODBC for developing programs that access Oracle, for example, then JDBC will be more familiar to your developers than SQLJ. One final issue for eDBAs confronted with Java at their shop: you will need to have at least a rudimentary understanding of how to read Java code. Most DBAs, at some point in their career, get involved in application tuning, debugging, or designing. Some wise organizations make sure that all application code is submitted to a DBA Design Review process before it is promoted to production status. The design review is performed to make sure that the code is efficient, effective, and properly coded. We all know that application and SQL is the


single biggest cause of poor relational performance. In fact, most experts agree that 70% to 80% of poor "relational" performance is caused by poorly written SQL and application logic. So reviewing programs before they are moved to production status is a smart thing to do. Now, if the code is written in Java, and you, as a DBA do not understand Java, how will you ever be able to provide expert analysis of the code during the review process? And even if you do not conduct DBA Design Reviews, how will you be able to tune the application if you do not at least understand the basics of the code? The answer is -- you can not! So plan on obtaining a basic education in the structure and syntax of Java. You will not need Java knowledge at an expert coding level, but instead at an introductory level so you can read and understand the Java code.

Resistance is Futile Well, you might argue that portability is not important. I can hear you saying "I've never written a program for DB2 on the mainframe and then decided, oh, I think I'd rather run this over on our RS/6000 using Informix on AIX." Well, you have a point. Portability is a nice-to-have feature for most organizations, not a mandatory one. The portability of Java code helps software vendors more than IT shops. But if software vendors can reduce cost, perhaps your software budget will decrease. Well, you can dream, can't you? Another issue that makes portability difficult is SQL itself. If you want to move an application program from one database platform to another, you will usually need to tweak or re-code the SQL statements to ensure efficient performance. Each

Resistance is Futile 127

RDBMS has quirks and features not supported by the other RDBMS products. But do not get bogged down thinking about Java in terms of portability alone. Java provides more benefit than mere portability. Remember, it is easier to use than other languages, helps to promote application availability, and eases web development. In my opinion, resisting the Java bandwagon is futile at this point.

Conclusion Since Java is clearly a part of the future of e-business, eDBAs will need to understand the benefits of Java. But, clearly, that will not be enough for success. You also will need a technological understanding of how Java works and how relational data can be accessed efficiently and effectively using Java. Beginning to learn Java today is a smart move- one that will pay off in the long-term, or perhaps near-term future! And remember this column is your column, too! Please feel free to e-mail us with any burning e-business issues you are experiencing in your shop and I'll try to discuss it in a future column. And please share your successes and failures along the way to becoming an eDBA. By sharing our knowledge we make our jobs easier and our lives simpler.


Tools of the Trade: XML

CHAPTER

16 New Technologies of the eDBA: XML

This is the third installment of my regular eDBA column, in which we explore and investigate the skills required of DBAs to support the data management needs of an e-business. As organizations move from a traditional business model to an e-business model, they will also introduce many new technologies. Some of these technologies, such as connectivity, networking, and basic Web skills, are obvious. But some are brand new and will impact the way in which eDBAs perform their jobs. In the last eDBA column I discussed one new technology: Java. In this edition we will examine another new technology: XML. The intent here is not to deliver an in-depth tutorial on the subject, but to introduce the subject and describe why an eDBA will need to know XML and how it will impact their job.

What is XML? XML is getting a lot of publicity these days. If you believe everything you read, then XML is going to solve all of our interoperability problems, completely replace SQL, and possibly even deliver world peace. In reality, all of the previous assertions about XML are untrue.

New Technologies of the eDBA: XML 129

XML stands for eXtensible Markup Language. Like HTML, XML is based upon SGML (Standard Generalized Markup Language). HTML uses tags to describe how data appears on a

Web page. But XML uses tags to describe the data itself. XML retains the key SGML advantage of self-description, while avoiding the complexity of full-blown SGML. XML allows tags to be defined by users that describe the data in the document. This capability gives users a means for describing the structure and nature of the data in the document. In essence, the document becomes self-describing. The simple syntax of XML makes it easy to process by machine while remaining understandable to people. Once again, let's use HTML as a metaphor to help us understand XML. HTML uses tags to describe the appearance of data on a page. For example the tag, " text ", would specify that the "text" data should appear in bold face. XML uses tags to describe the data itself, instead of its appearance. For example, consider the following XML describing a customer address: <CUSTOMER> <first_name>Craig</first_name> <middle_initial>S.</middle_initial> <last_name>Mullins</last_name> <company_name>BMC Software, Inc.</company_name> <street_address>2101 CityWest Blvd.</street_address> <city>Houston</city> <state>TX</state> <zip_code>77042</zip_code> <country>U.S.A.</country> </CUSTOMER>

XML is actually a meta language for defining other markup languages. These languages are collected in dictionaries called Document Type Definitions (DTDs). The DTD stores definitions of tags for specific industries or fields of knowledge. So, the meaning of a tag must be defined in a "document type declaration" (DTD), such as:


<!DOCTYPE CUSTOMER [ <!ELEMENT CUSTOMER (first_name, middle_initial, last_name, company_name, street_address, city, state, zip_code, country*)> <!ELEMENT first_name (#PCDATA)>

<!ELEMENT middle_initial (#PCDATA)> <!ELEMENT last_name (#PCDATA)> <!ELEMENT company_name (#PCDATA)> <!ELEMENT street_address (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip_code (#PCDATA)> <!ELEMENT country (#PCDATA)> ]

The DTD for an XML document can either be part of the document or stored in an external file. The XML code samples shown are meant to be examples only. By examining them, you can quickly see how the document itself describes its contents. For data management professionals, this is a plus because it eliminates the trouble of tracking down the meaning of data elements. One of the biggest problems associated with database management and processing is finding and maintaining the meaning of stored data. If the data can be stored in documents using XML, the documents themselves will describe their data content. Of course, the DTD is a rudimentary vehicle for defining data semantics. Standards committees are working on the definition of the XML Schema to replace the DTD for defining XML tags. The XML Schema will allow for more precise definition of data, such as data types, lengths and scale. The important thing to remember about XML is that it solves a different problem than HTML. HTML is a markup language, but XML is a meta language. In other words, XML is a language that generates other kinds of languages. The idea is to use XML to generate a language specifically tailored to each requirement you encounter. It is essential to understand this paradigm shift in order to understand the power of XML. (Note: XSL, or eXtensible Stylesheet Language, can be used with XML to format XML data for display.)

What is XML? 131

In short, XML allows designers to create their own customized tags, thereby enabling the definition, transmission, validation and interpretation of data between applications and between organizations. So the most important reason to learn XML is that it is quickly becoming the de facto standard for application interfaces.

Some Skepticism There are, however, some problems with XML. Support for the language, for example, is only partial in the standard and most popular Web browsers. As more XML capabilities gain support and come to market, this will become less of a problem. Another problem with XML lies largely in market hype. Throughout the industry, there is plenty of confusion surrounding XML. Some believe that XML will provide metadata where none currently exists, or that XML will replace SQL as a data access method for relational data. Neither of these assertions is true. There is no way that any technology, XML included, can conjure up information that does not exist. People must create the metadata tags in XML for the data to be described. XML enables self-describing documents; it doesn’t describe your data for you. Moreover, XML doesn’t perform the same functions as SQL. As a result, XML can’t replace it. As the standard access method for relational data, SQL is used to "tell" a relational DBMS what data is to be retrieved. XML, on the other hand, is a document description language that describes the contents of


data. XML may be useful for defining databases, but not for accessing them.

Integrating XML With the DBMS, more and more of the popular DBMS products are providing support for XML. Take, for example, the XML Extender provided with DB2 UDB Version 7. The XML Extender enables XML documents to be integrated with DB2 databases. By integrating XML into DB2, you can more directly and quickly access the XML documents as well as search and store entire XML documents using SQL. You also have the option of combining XML documents with traditional data stored in relational tables. When you store or compose a document, you can invoke DBMS functions to trigger an event to automate the interchange of data between applications. An XML document can be stored complete in a single text column. Or XML documents can be broken into component pieces and stored as multiple columns across multiple tables. The XML Extender provides user-defined data types (UDTs) and user-defined functions (UDFs) to store and manipulate XML in the DB2 database. UDTs are defined by the XML Extender for XMLVARCHAR, XMLCLOB and XMLFILE. Once the XML is stored in the database, the UDFs can be used to search and retrieve the XML data as a complete document or in pieces. The UDFs supplied by the XML Extender include: storage functions to insert XML documents into a DB2

database

Integrating XML 133


retrieval functions to access XML documents from XML columns extraction functions to extract and convert the element

content or attribute values from an XML document to the data type that is specified by the function name update functions to modify element contents or attribute

values (and to return a copy of an XML document with an updated value)

More and more DBMS products are providing capabilities to store and generate XML. The basic functionality enables XML to be passed back and forth between databases in the DBMS. Refer to Figure 1.

Figure 1. XML and Database Integration

Defining the Future Web Putting all skepticism and hype aside, XML is definitely the wave of the immediate future. The future of the Web will be defined using XML. The benefits of self-describing documents are just too numerous for XML to be ignored. Furthermore, the allure of using XML to generate an application-specific

language is powerful. It is this particular capability that will drive XML to the forefront of computing. More and more organizations are using XML to transfer data, and more capabilities are being added to DBMS products to support XML. Clearly, DBAs will need to understand XML as their companies migrate to the e-business environment. Learning XML today will go a long way toward helping eDBAs be prepared to integrate XML into their data management and application development infrastructure. For more details and specifics regarding XML, refer to the following website: http://www.w3.org/XML Please feel free to e-mail me with any burning e-business issues you are experiencing in your shop and I'll try to discuss them in a future column. And please share your successes and failures along the way to becoming an eDBA. By sharing our knowledge, we make our jobs easier and our lives simpler.

Defining the Future Web 135

Multivalue Database Technology Pros and Cons

CHAPTER

17

MultiValue Lacks Value With the advent of XML — itself of a hierarchic bent — there is effort to reposition the old “multivalue” (MV) database technology as “ahead of its time,” and the products based on it (MVDBMS) as undiscovered “diamonds in the rough.” Now, a well-known and often expressed peeve of mine is how widespread the lack of foundation knowledge is in the IT industry. It is the conceptual and logical-physical confusion deriving from it that produced the MV technology in the first place, and is now behind current attempts at its resurgence. And there hardly is a more confused bunch than the proponents of MV technology. Anything written by them is so incomprehensible and utterly confused that it readily invokes Date’s Incoherence Principle: It is not possible to treat coherently that which is incoherent. What is more, efforts to introduce clarity and precision meet with even more fuzziness and confusion (see "No Value In MultiValue" http://www.dmreview.com/editorial/dmdirect/dmdirect_article.cfm?EdID=5893&issue=101102&record=3). For anybody who believes that anything of value (pun intended) can come from such thinking, I have a bridge in Brooklyn for sale. Notwithstanding its being called “post-relational,” MV databases and DBMSs originate in the Pick operating system invented decades ago, and are essentially throwbacks to


hierarchic database technology of old. For a feel of how problematic the multivalue thinking — if it can be called that — is, consider quotes from two attempts to explain what MV technology is all about. It is always fascinating, although no longer that surprising, to see how many errors and how much confusion can be packed into short paragraphs. The first quote is the starting paragraph in "The Innovative 3-Dimensional Data Model," an explanation of the Pick model posted on the web site of MV software.

“D3 significantly improves on the relational data structure by providing the ability to store all of the information that would require three separate tables in a relational database, in a single 3-dimensional file. Through the use of variable field and variable record lengths, the D3 database system uses what is called a 'post-relational" or "three-dimensional' data model. Using the same example (project reporting by state and fiscal period), a single file can be set up for a project. Values that are specific to each state are grouped logically and stored in the project record itself. In addition, the monthly budget and actual numbers can then be located in the same project definition item. There is no limit to the amount of data that can be stored in a single record using this technology … the same data that requires a multi-table relational database structure can be constructed using a single file in D3."

Comments: It is at best misleading, and at worst disingenuous to claim

the MV data structure is an “improvement” on the relational structure. First, the hierarchic structure underlying MV precedes the relational model. And second, the relational

MultiValue Lacks Value 137

model was invented to replace the hierarchic model (which it did), the exact opposite of the claim! Note: In fact, if I recall correctly, the Pick operating system preceded even the first generation hierarchic DBMSs and was only later extended to database management.

The logical-physical confusion raises its ugly head right in the first sentence of the first paragraph. Unlike a MV file, which is physical, relational tables are logical. There is nothing in the relational model — and intentionally so — to dictate how the data in tables should be physically stored and, therefore, nothing to prevent RDBMSs to store data from multiple logical tables in one physical file. And, in fact, even SQL products — which are far from true implementations of the relational model — support such features. The important difference is that while true RDBMSs (TRDBMS) insulate applications and users from the physical details, MVDBMSs do not. Paper representations of R-tables are two-dimensional

because they are pictures of R-tables, not the real thing. A R-table with N columns is a N-dimensional representation of the real world. The term “post-relational” — which has yet to be precisely

defined — is used in marketing contexts to obscure the non-relational nature of MV products. Neither it, nor the term “three-dimensional” have anything to do with “variable field” and “variable record length,” implementation features that can be supported by TRDBMSs. That current SQL DBMSs lack such support is not a relational, but product flaw. It’s the “Values that are specific to each state [that] are

grouped logically” that give MV technology its name and throw into serious question whether MV technology


adheres to the relational principle of single-valued columns. The purpose of this principle is practical: it avoids serious complications, and takes advantage of the sound foundations of logic and math. This should not be interpreted to mean that "single-valued" means no lists, arrays, and so on. A value can be anything and of arbitrary complexity, but it must be defined as such at the data type (domain) level, and MV products do not do that. In fact, MV files are not relational databases for a variety of reasons, so even if they adhered to the SVC principle, it wouldn’t have made a difference (for an explanation why, see the first two papers in the new commercial DATABASE FOUNDATIONS SERIES launched at DATABASE DEBUNKINGS -

http://www.dbdebunk.com/.) The second quote is from a response by Steve VanArsdale to my two-part article, "The Dangerous Illusion: Normalization, Performance and Integrity" in DM Review)


“Multi-value has been called an evolution of the post-relational data base. It is based upon recognition of a simple truth. First considered in the original theories and mathematics surrounding relational data base rules in the 1960’s, multi-value was presumed to be inefficient in the computer systems of the time. A simplifying assumption was made that all data could be normal. Today that is being reconsidered. The simple truth is that real data is not normalized; people have more than one phone number. And they buy more than one item at a time, sometimes with more than one price and quantity. Multi-value is a data base model with a physical layout that allows systematic manipulation and presentation of messy, natural, relational, data in any form, first-normal to fifth-normal. In other words: with repeating groups in a normalized (one-key and one-key-only) table.”

VanArsdale repeats the “post-relational evolution” nonsense. He suffers from the same physical-logical confusion, distorts history to fit his arguments, and displays an utter lack of knowledge and understanding of data fundamentals. Some “simple truth.” Multivalue was not “first considered in the

original theories and mathematics surrounding relational database rules." The relational model was invented explicitly to replace hierarchic technology, of which multivalue is one version, the latter having nothing to do with mathematics. VanArsdale has it backwards. It was, in fact, relational

technology that was deemed inefficient at its inception by hierarchic proponents who claimed their approach had better performance. The relational model has indeed simplifying purposes, but that is an issue separate of efficiency. How can logic, which governs truth of propositions about the real world, have anything to say about the performance of hardware and software (except, of course, that via data independence, it gives complete freedom to DBMS designers and database implementers to do whatever they darn please at the physical level to maximize performance, as long as they don’t expose that level to users)?


How we represent data logically has to do with the kind of questions we need to ask of databases — data manipulation — and with ensuring correctness (defined as consistency) via DBMS integrity enforcement. We have learned from experience that hierarchic representations complicate manipulation and integrity, completely ignored by MV proponents. What is more, such complications are unnecessary: there is nothing that can be done with hierarchic databases, that cannot be achieved with relational databases in a simpler manner. And simplicity means easier and less

costly database design and administration, and fewer application development and maintenance efforts. I have no idea what “Multi-value is a data base model with a

physical layout that allows systematic manipulation and presentation of messy, natural, relational, data in any form, first-normal to fifth-normal” means:

o What is a “database model”? Is it anything like a data model? If so, why use a different name?

o What “physical layout” does not allow “systematic manipulation and presentation”? And what does a physical layout have to do with the data model — any data model — employed at the logical level? Is there any impediment to relational databases implementing any physical layout that multi-value databases implement?

o Is “messy natural, relational data” a technical term? Data is not “naturally” relational or hierarchic/multi-value. Any data can be represented either way, and Occam’s Razor says the simplest one should be preferred (which is exactly what Codd’s Information Principle says.).

o Every R-table (formally, time-varying relation, or relvar) is in first normal form by definition. But multi-value logical structures are not relations, so does it make sense to speak of normal forms in general, and 1NF in particular in the MV context? (Again, see the FOUNDATION SERIES.)

If I am correct, then how can multi-value proponents claim that their technology is superior to relational technology? Regarding the first quote above: There is no reference to integrity constraints.


The focus is on one, relatively simple application — “project reporting by state and fiscal” — for which the hierarchic representation happens to be convenient; no consideration is given to other applications, which it likely complicates. What happens if and when the structure changes?

It is common for MV proponents to use as examples relatively simple and fixed logical structures, to focus on a certain type of application, and to ignore integrity altogether. Note: This is, in fact, exactly what Oracle did when it added the special CONNECT BY clause to its version of SQL, for explode operations on tree structures. Aside from violating relational closure by producing results with duplicates and meaningful ordering, it works only for very simple trees. Why don’t MV proponents mention integrity? You can figure that out from another reaction to my above mentioned DM Review article by Geoff Miller:

“The valid criticism of the MV structure is that the flexibility which it provides means that integrity control generally has to be done at the application level rather than the DBMS level — however, in my experience this is easily managed.”[emphasis added]

I would not vouch for flexibility (databases with a hierarchic bent like MVDBMSs are notoriously difficult to change), but be that as it may, anybody with some fundamental knowledge knows that integrity is 70 — 80 percent of database effort. Schema definition and maintenance is, in effect, nothing but specification and updating of integrity constraints, the sum total of which are a DBMSs understanding of what the database means (the internal predicate, see Practical Issues in Database Management


http://www.dbdebunk.com/books.htm). It follows that in the absence of integrity support, a DBMS does not know what the database means and, therefore, cannot manage it. Products failing to support the integrity function — leaving it to users in applications — are not fully functional DBMSs. That is what we used to have before we had DBMSs: files and application programs. That MV products do not support a full integrity function is a direct implication of the hierarchic MV structure: data manipulation of hierarchic databases is very complex and, therefore, so is integrity, which is a special application of manipulation. So complex that integrity is not implemented at all, which, by the way, is one reason performance, may sometimes be better. In other words, they trade integrity for performance. Chris Date says about hierarchic structures like MV and XML:

“Yet another problem with [hierarchies] is that it’s usually unclear as to why one hierarchy should be chosen over another. For example, why shouldn’t we nest [projects] inside [states], instead of the other way around? Note very carefully too that when the data has a “natural” hierarchic structure as — it might be argued — in the case with (e.g.) departments and employees [projects and states is not that natural], it does not follow that it should be represented hierarchically, because the hierarchic representation isn’t suitable for all of the kinds of processing that might need to be done on the data. To be specific, if we nest employees inside departments, then queries like “Get all employees in the accounting department” might be quite easy, but queries like “Get all departments that employ accountants” might be quite hard.”


So here’s what I suggest users insist on, if they want to assess MV products meaningfully. For a real-world database that has a moderately complex schema that sometimes changes, a set of integrity constraints covering the four types of constraints supported by the relational model, and multiple applications accessing data in different ways: Have MV proponents formulate constraints in applications

and queries the MV way, Have relational proponents design the database and

formulate the constraints in the database and queries using truly relational (not SQL!) products such as Alphora’s Dataphor data language, and/or an implementation of Required Technologies’ TransRelational Model™, Then, judge which approach is superior by comparing them.

To quote:

“I've been teaching myself Dataphor, a product that I learned about through your Web site! As a practice project, I've been rewriting a portion of a large Smalltalk application to use Dataphor, and I've been stunned to see just how much application code disappears when you have a DBMS that supports declarative integrity constraints. In some classes, over 90% of the methods became unnecessary.” —David Hasegawa, "On Declarative Integrity Support and Dataphor"

Wouldn’t Miller say this is easier to manage?

References "On Multivalue Technology"

(http://www.dbdebunk.com/multivalue.htm)


"On Intellectual Capacity in the Industry" (http://www.pgro.uk7.net/intellectual_capacity.htm) "More on Denormalization, Redundancy and MultiValue

DBAMSs" (http://www.dbdebunk.com/denorm_0302.htm) "More on Repeating Groups and Normalization"

(http://www.dbdebunk.com/rep_grps_norm_1117.htm)

References 145

Securing your Data CHAPTER

18 Data Security Internals

Back in the days of Oracle7, Oracle security was a relatively trivial matter. Individual access privileges were granted to individual users, and this simple coupling of privileges-to-users comprised the entire security scheme of the Oracle database. However, with Oracle's expansion into enterprise data security, the scope of Oracle security software has broadened. Oracle9i has a wealth of security options, and these options are often bewildering to the IT manager who is charged with ensuring data access integrity. These Oracle tools include role-based security, Virtual Private Databases (VPD) security, and grant execute security:

Role-based security — Specific object-level and system-level privileges are grouped into roles and granted to specific database users. Object privileges can be grouped into roles, which can then be assigned to specific users.

Virtual private databases — VPD technology can restrict access to selected rows of tables. Oracle Virtual Private Databases (fine-grained access control) allows for the creation of policies that restrict table and row access at runtime.

Grant-execute security — Execution privileges on procedures can be tightly coupled to users. When a user executes the procedures, they gain database access, but only within the scope of the procedure. Users are granted execute privileges


on functions and stored procedures. The grantee takes on the authority of the procedure owner when executing the procedures, but has no access outside the procedure.

Regardless of the tool, it is the job of the Oracle manager to understand the uses of these security mechanisms and their appropriate use within an Oracle environment. At this point, it's very important to note that all of the Oracle security tools have significant overlapping functionality. When the security administrator mixes these tools, it is not easy to tell which specific end users have access to what part of the database. For example, the end user who's been granted execution privileges against a stored procedure will have access to certain database entities, but this will not be readily apparent from any specific role-based privileges that that user has been granted. Conversely, an individual end user can be granted privileges for a specific database role, but that role can be bypassed by the use of Oracle's Virtual Private Database (VPD) technique. In sum, each of the three Oracle security methods provides access control to the Oracle database, but they each do it in very different ways. The concurrent use of any of these products can create a nightmarish situation whereby an Oracle security auditor can never know exactly who has access to what specific database entities. Let's begin by reviewing traditional role-based Oracle security.

Traditional Oracle Security Data-level security is generally implemented by associating a user with a "role" or a "subschema" view of the database. These roles are profiles of acceptable data items and

Traditional Oracle Security 147

operations, and the role profiles are checked by the database engine at data request time (refer to figure 1). Oracle's traditional role-based security comes from the standard relational database model. In all relational databases, specific object- and system-level privileges can be created, grouped together into roles, and then assigned to individual users. This method of security worked very well in the 1980s and 1990s, but has some significant shortcomings for individuals charged with managing databases with many tens of thousands of users, and many hundreds of data access requirements.

Figure 1: Traditional relational security.

Without roles, each individual user would need to be granted specific access to every table that they need. To simplify


security, Oracle allows for the bundling of object privileges into roles that are created and then associated with users. Below is a simple example: create role cust_role; grant select on customer to cust_role; grant select, update on orders to cust_role; grant cust_role to scott;

Privileges fall into two categories, system privileges and object privileges. System privileges can be very broad in scope because they grant the right to perform an action, or perform an action on a particular TYPE of object. For example, "grant select any table to scott" invokes a system-level privilege. Because roles are a collection of privileges, roles can be organized in a hierarchy, and different user can be assigned roles according to their individual needs. New roles can be created from existing roles, from system privileges, from object privileges, or any combination of roles (refer to figure 2).

Traditional Oracle Security 149

Figure 2: A sample hierarchy for role-based Oracle security.

While this hierarchical model for roles may appear simple, there are some important caveats that must be considered.

Concerns About Role-based Security There are several areas in which administrators get into trouble. These are granting privileges using the WITH ADMIN option, granting system-level privileges, and granting access to the special PUBLIC user. One confounding feature of role-based security is the cascading ability of GRANT privileges. For example, consider this simple command: grant select any table to JONES with GRANT OPTION;


Here we see that the JONES user has been given a privilege with the "GRANT OPTION," and JONES gains the ability to grant any of their privileges to any other Oracle users. When using grant-based security, there is a method to negate all security for a specific object. Security can be explicitly turned off for an object by using "PUBLIC" as the receiver of the grant. For example, to turn off all security for the CUSTOMER table, we could enter: grant select on customer to PUBLIC;

Security is now effectively turned off for the CUSTOMER table, and restrictions may not be added with the REVOKE command. Even worse, all security can be negated with a single command: Grant select any table to PUBLIC;

Closing the Back Doors As we know, granting access to a table allows the user to access that table anywhere, including ad-hoc tools such as ODBC, iSQL, and SQL*Plus. Session-level security can be enforced within external Oracle tools as well as within the database. Oracle provides their PRODUCT_USER_PROFILE table to enforce tool-level security, and the user may be disabled from updating in SQL*Plus by making an entry into this table: For example, to disable updates for user JONES, the DBA could state:

Closing the Back Doors 151

INSERT INTO PRODUCT_USER_PROFILE (product, userid, attribute, char_value) VALUES ("SQL*Plus", "JONES", "UPDATE", "DISABLED");

User JONES could still performs updates within the application, but would be prohibited from updating while in the SQL*Plus tool. To disable unwanted commands for end-users, a wildcard can be used in the attribute column. To disable the DELETE command for all users of SQL*Plus, you could enter: INSERT INTO PRODUCT_USER_PROFILE (product, userid, attribute, char_value) VALUES ("SQL*Plus", "%", "DELETE", "DISABLED");

Unfortunately, while this is great for excluding all users, we cannot alter the tables to allow the DBA staff to have DELETE authority. Next, let's examine an alternative to role-based security, Oracle's Virtual Private Databases.

Oracle Virtual Private Databases Oracle's latest foray into Oracle security management is a product with several names. Oracle has two official names for this product, virtual private databases, or VPD, which as also known as fine-grained access control. To add to the naming confusion, it is also commonly known as Row Level Security and the Oracle packages have RLS in the name. Regardless of the naming conventions, VPD security is a very interesting new component of Oracle access controls.


At a high-level, VPD security adds a WHERE clause predicate to every SQL statement that is issued on behalf of an individual and user. Depending upon the end users access, the WHERE clause constrains information to specific rows a within the table, hence the name row-level security. But we can also do row-level security with views. It is possible to restrict SELECT access to individual rows and columns within a relational table. For example, assume that a person table contains confidential columns such as SALARY. Also assume that this tables contains a TYPE column with the values EXEMPT, NON_EXEMPT and MANAGER. We want our end-users to have access to the person table, but we wish to restrict access to the SALARY columns and the MANAGER rows. A relational view could be created to isolate the columns and rows that are allowed: create view finance_view as select name, address from person where department = 'FINANCE';

We may now grant access to this view to anyone: grant select on FINANCE_VIEW to scott;

Let's take a look at how VPD works. When users access a table (or view) that has a security policy:

Oracle Virtual Private Databases 153

1. The Oracle server calls the policy function, which returns a "predicate." A predicate is a WHERE clause that qualifies a particular set of rows within the table. The heart of VPD security is the policy transformation of SQL statements. At runtime, Oracle produces a transient view with the text:

SELECT * FROM scott.emp WHERE P1

2. Oracle then dynamically rewrites the query by appending the predicate to the users' SQL statements. The VPD methodology is used widely for Oracle systems on the Web, where security must be maintained according to instantiated users, but at the same time provide a method whereby the data access can be controlled through more procedural methods. Please note that the VPD approach to Oracle security requires the use of PL/SQL functions to define the security logic.

There are several benefits to VPD security: Multiple security — You can place more than one policy

on each object, as well as stack highly- specific policies upon other base policies. Good for Web Apps — In Web applications, a single user

often connects to the database. Hence, row-level security can easily differentiate between users. No back-doors — Users no longer bypass security policies

embedded in applications, because the security policy is attached to the data.

To understand how VPD works, let's take a closer look at the emp_sec procedure below. Here we see that the emp_sec function returns a SQL predicate, in this case "ENAME=xxxx," in which XXX is the current user (in Oracle, we can get a current user ID by calling the sys_context function). This predicate is appended to the WHERE clause of every SQL statement issued by the user when they reference the EMP table. CREATE OR REPLACE FUNCTION emp_sec (schema IN varchar2, tab IN varchar2)


RETURN VARCHAR2 AS

BEGIN RETURN 'ename=''' || sys_context( 'userenv', 'session_user') || ''''; END emp_sec; /

Once the function is created, we call the dbms_rls (row-level security) package. To create a VPD policy, we invoke the add_policy procedure, and figure 3 shows an example of the invocation of the add_policy procedure. Take a close look at this policy definition:

Figure 3: Invoking the add_policy Procedure.

In this example, the policy dictates that:


Whenever the EMP table is referenced In a SELECT query A policy called EMP_POLICY will be invoked Using the SECUSR PL/SQL function.

Internally, Oracle treats the EMP table as a view and does the view expansion just like the ordinary view, except that the view text is taken from the transient view instead of the data dictionary. If the predicate contains subqueries, then the owner (definer) of the policy function is used to resolve objects within the subqueries and checks security for those objects. In other words, users who have access privilege to the policy-protected objects do not need to know anything about the policy. They do not need to be granted object privileges for any underlying security policy. Furthermore, the users also do not require EXECUTE privileges on the policy function, because the server makes the call with the function definer's right. In figure 4 we see the VPD policy in action. Depending on who is connected to the database, different row data is display from identical SQL statements. Internally, Oracle is rewriting the SQL inside the library cache, appending the WHERE clause to each SQL statement.


Figure 4: The VPD Policy in Action.

While the VPD approach to Oracle security works great, there are some important considerations. The foremost benefit of VPD is that the database server automatically enforces these security policies, regardless of the how the data is accessed, through the use of variables that are dynamically defined within the database user's session. The downsides to VPD security are that VPD security policies are required for every table accessed inside the schema, and the user still must have access to the table via traditional GRANT statements. Next, let examine a third type of Oracle security, the grant execute method of security.


Procedure Execution Security Now we visit the third main area of Oracle security, the ability to grant execution privileges on specific database procedures. Under the grant execute model, and individual needs nothing more than connect privileges to attach to the Oracle database. Once attached, execution privileges on any given stored procedure, package, or function can be directly granted to each end user. At runtime, the end-user is able to execute the STORE procedure, taking on the privileges of the owner of the STORE procedure. As we know, one shortcoming of traditional role-based security is that end users can bypass their application screens, and access their Oracle databases through SQL*Plus or iSQL. One benefit of the grant execute model is that you ensure that your end users are only able to use their privileges within the scope of your predefined PL/SQL or Java code. In many cases, the grant execute security method provides tighter control access security because it controls not only those database entities that a person is able to see, but what they're able to do with those entities. The grant execute security model fits in very nicely with the logic consolidation trend over the decade. By moving all of the business logic into the database management system, it can be tightly coupled to the database and at the same time have the benefit of additional security. The Oracle9i database is now the repository not only for the data itself, but for all of the SQL and stored procedures and functions that transform the data. By consolidating both the data and procedures in the central repository, the Oracle security manager has much tighter control over the entire database enterprise.


There are many compelling benefits to putting all Oracle SQL inside stored procedures, including: Better performance — Stored procedures load once into

the shared pool and remain there unless they become paged out. The stored procedures can be bundled into packages, which can then be pinned inside the Oracle SGA for super-fast performance. At the PL/SQL level, the stored procedures can be compiled into C executable code where they run very fast compared to external business logic. Coupling of data with behavior — Developers can use

Oracle member methods to couple Oracle tables with the behaviors that are directly associated with each table. This coupling provides modular, object-oriented code. Improved security — By coupling PL/SQL member

methods and stored procedures with grant execute access, the manager gains complete access control, both over the data that is accessed and how the data is transformed. Isolation of code — Since all SQL is moved out of the

external programs and into stored procedures, the application programs become nothing more than calls to generic stored procedures. As such, the database layer becomes independent from the application layer.

The grant execute security can give much tighter control over security than data-specific security. The DBA can authorize the application owners with the proper privileges to perform their functions, and all of the end-users will not have any explicit GRANTS against the database. Instead, they are granted EXECUTE on the procedure, and the only way that the user will be able to access the data is though the procedure. Remember, the owner of the procedure governs the access rights to the data. There is no need to create huge GRANT

Procedure Execution Security 159

scripts for each any every end-user, and there is no possibility of end users doing an "end-run" and accessing the tables from within other packages. The grant execute access method has its greatest benefit in the coupling of data access security and procedural security. When an individual end-user is granted execute privileges against a store procedure or package, the end user may use those packages only within the context of the application itself. This has the side benefit of enforcing not only table-level security, but column-level security. Inside the PL/SQL package, we can specify individual WHERE predicates based on the user ID and very tightly control their access to virtually any distinct data item within our Oracle database. The confounding problem with procedures and packages is that their security is managed in an entirely different fashion from other GRANT statements. When a user is given execution privileges on a package, they will be operating under the security domain of the owner of the procedure, and not their defined security domain. In other words, a user who does not have privileges to update employee rows can get this privilege by being authorized to use a procedure that updates employees. From the DBA's perspective, their database security audits cannot easily reveal this update capability.

Conclusion By themselves, each Oracle security mechanism does an excellent job of controlling access to data. However, it can be quite dangerous (especially from an auditing perspective) to mix and manage between the three security modes. For example, an Oracle shop using role-based security that also decided to use virtual private databases would have a hard time


reconciling what users had specific access to what data tables and rows. Another example would be mixing the grant execute security with either VPD security. The grant execute security takes those specific privileges off the owner of the procedure, such that each user who has been granted access to a store procedure may (or may not) be seeing all off the database entities that are allowed by the owner of the procedure. In other words, only a careful review of the actual PL/SQL or Java code will tell us exactly what a user is allowed to view inside the database. As Oracle security continues to evolve, we will no doubt see more technical advances in data control methods. For now, it is the job of the Oracle DBA to ensure that all access to data is tightly controlled and managed.

Conclusion 161

Maintaining Efficiency CHAPTER

19 eDBA: Online Database Reorganization

The beauty of relational databases is the way they make it easy for us to access and change data. Just issue some simple SQL -- select, insert, update, or delete -- and the DBMS takes care of the actual data navigation and modification. To make this level of abstraction, a lot of complexity is built into the DBMS; it must provide in-depth optimization routines, leverage powerful performance enhancing techniques, and handle the physical placement and movement of data on disk. Theoretically, this makes everyone happy. The programmer's interface is simplified and the DBMS takes care of the hard part -- manipulating the data and coordinating its actual storage. But in reality, things are not quite that simple. The way the DBMS physically manages data can cause performance problems. Every DBA has experienced a situation in which an application slows down after it has been in production for a while. But why this happens is not always evident. Perhaps the number of transactions issued has increased or maybe the volume of data has increased. But for some problem, these factors alone will not cause large performance degradation. In fact, the problem might be with disorganized data in the database. Database disorganization occurs when a database's logical and physical storage allocations contain many scattered areas of storage that are too small, not physically contiguous, or too disorganized to be used productively.


To understand how performance can be impacted by database disorganization, let's examine a "sample" database as modifications are made to data. Assume that a tablespace exists that consists of three tables across multiple blocks. As we begin our experiment, each table is contained in contiguous blocks on disk as shown in Figure 1. No table shares a block with any other. Of course, the actual operational specifics will depend on the DBMS being used as well as the type of tablespace, but the scenario is generally applicable to any database at a high level -- the only difference will be in terminology (for example, Oracle block versus DB2 page).

Figure 1: An organized tablespace containing three tables

Now let's make some changes to the tables in this tablespace. First, let's add six rows to the second table. But no free space

eDBA: Online Database Reorganization 163

exists into which these new rows can be stored. How can the rows be added? The DBMS takes another extent into which the new rows can be placed. For the second change, let's update a row in the first table to change a variable character column; for example, let's change the LASTNAME column from "DOE" to "BEAUCHAMP." This update results in an expanded row size because the value for LASTNAME is longer in the new row: "BEAUCHAMP" consists of 9 characters whereas "DOE" only consists of 3. Let's make a third change, this time to table three. In this case we are modifying the value of every clustering column such that the DBMS cannot maintain the data in clustering sequence. After these changes the resultant tablespace most likely will be disorganized (refer to Figure 2). The type of data changes that were made can result in fragmentation, row chaining, and declustering.


Figure 2: The same tablespace, now disorganized

Fragmentation is a condition in which there are many scattered areas of storage in a database that are too small to be used productively. It results in wasted space, which can hinder performance. When updated data does not fit in the space it currently occupies, the DBMS must find space for the row using techniques like row chaining and row migration. With row chaining, the DBMS moves a part of the new, larger row to a location within the tablespace where free space exists. With row migrations the full row is placed elsewhere in the segment. In each case a block-resident pointer is used to locate either the rest of the row or the full row. Both row chaining and row migration will result in multiple I/Os being issued to read a single row. This will cause performance to suffer because multiple I/Os are more expensive than a single I/O.

eDBA: Online Database Reorganization 165

Finally, declustering occurs when there is no room to maintain the order of the data on disk. When clustering is used, a clustering key is specified composed of one or more columns. When data is inserted to the table, the DBMS attempts to insert the data in sequence by the values of the clustering key. If no room is available, the DBMS will insert the data where it can find room. Of course, this declusters the data and that can significantly impact the performance of sequential I/O operations.

Reorganizing Tablespaces To minimize fragmentation and row chaining, as well as to re-establish clustering, database objects need to be restructured on a regular basis. This process is known as reorganization. The primary benefit is the resulting speed and efficiency of database functions because the data is organized in a more optimal fashion on disk. The net result of reorganization is to make Figure 2 look like Figure 1 again. In short, reorganization is useful for any database because data inevitably becomes disorganized as it is used and modified. DBAs can reorganize "manually" by completely rebuilding databases. But to conduct a manual reorganization requires a complex series of steps to accomplish, for example: Backup the database Export the data Delete the database object(s) Re-create the database object(s) Sort the exported data (by the clustering key)


Import the data Reorganization usually requires the database to be down. The high cost of downtime creates pressures both to perform and to delay preventive maintenance -- a familiar quandary for DBAs. Third party tools are available that automate the manual process of reorganizing tables, indexes, and entire tablespaces -- eliminating the need for time- and resource-consuming database rebuilds. In addition to automation, this type of tool typically can analyze whether reorganization is needed at all. Furthermore, ISV reorg tools operate at very high speeds to reduce the duration of outages.

Online Reorganization Modern reorganization tools enable database structures to be reorganized while the data is up and available. To accomplish an online reorganization, the database structures to be reorganized must be copied. Then this "shadow" copy is reorganized. When the shadow reorganization is complete, the reorg tool "catches up" by reading the log to apply any changes that were made during the online reorganization process. Some vendors offer leading-edge technology that enables the reorg to catch up without having to read the log. This is accomplished by caching data modifications as they are made. The reorg can read the cached information much quicker than trying to catch up by reading the log. Sometimes the reorganization process requires the DBA to create special tables to track and map internal identifiers and pointers as they are changed by the reorg. More sophisticated solutions keep track of such changes internally without requiring these mapping tables to be created.

Online Reorganization 167

Running reorganization and maintenance tasks while the database is online enhances availability -- which is the number one goal of the eDBA. The more availability that can be achieved for databases that are hooked up to the Internet, the better the service that your online customers will receive. And that is the name of the game for the web-enabled business. When evaluating the online capabilities of a reorganization utility, the standard benchmarking goals are not useful. For example, the speed of the utility is not as important because the database remains online while the reorg executes. Instead, the more interesting benchmark is what else can run at the same time. The online reorg should be tested against multiple different types of concurrent workload -- including heavy update jobs where the modifications are both sequential and random. The true benefit of the online reorg should be based on how much concurrent activity can run while the reorg is running -- and still result in a completely reorganized database. Some online reorg products will struggle to operate as the concurrent workload increases -- sometimes requiring the reorg to be cancelled.

Synopsis Reorganizations can be costly in terms of downtime and computing resources. And it can be difficult to determine when reorganization will actually create performance gains. However, the performance gains that can be accrued are tremendous when fragmentation and disorganization exist. The wise DBA will plan for regular database reorganization based on an examination of the data to determine if the above types of disorganization exist within their corporate databases.


Moreover, if your company relies on databases to service its Web-based customers, you should purchase the most advanced online reorganization tools available because every minute of downtime translates into lost business. An online reorganization product can pay for itself very quickly if you can keep your web-based applications up and running instead of bringing them down every time you need to run a database reorg.

Synopsis 169

The Highly Available Database

CHAPTER

20 The eDBA and Data Availability

Greetings and welcome to a new monthly column that explores the skills required of DBAs as their companies move from traditional business models to become e-businesses. This, of course, begs the question: what is meant by the term e-business? There is a lot of marketing noise surrounding e-business and sometimes the messages are confusing and disorienting. Basically, e-business can be thought of as the transformation of key business processes through the use of Internet technologies. Internet usage, specifically web usage, is increasing at a rapid pace and infiltrating most aspects of our lives. Web addresses are regularly displayed on television commercials, many of us buy books, CDs, and even groceries on-line instead of going to traditional "bricks and mortar" stores, and the businesses where we work are conducting web-based transactions with both their customers and suppliers. Indeed, Internet technologies are pervasive and the Internet is significantly changing the way we do business. This column will discuss how the transformation of businesses to e-businesses impacts the disciplines of data management and database administration. Please feel free to e-mail me with any burning issues you are experiencing in your shop and to share both successes and failures along the way to becoming an eDBA, that is, a DBA who manages the data of an e-business.


The First Important Issue is Availability Because an e-business is an online business, it can never close. There is no such thing as a batch window for an e-business application. Customers expect full functionality on the Web regardless of the time of day. And remember, the Web is worldwide-when it is midnight in Chicago it is 3:00 PM in Sydney, Australia. An e-business must be available and operational 24 hours a day, 7 days a week, 366 days a year (do not forget leap years). It must be prepared to engage with customers at any time or risk losing business to a company whose Web site is more accessible. Some studies show that if a web user clicks his mouse and does not receive a transmission back to his browser within seven seconds he will abandon that request and go somewhere else. On the web, your competitor is just a simple mouse click away. The net result is that e-businesses are more connected, and therefore must be more available in order to be useful. So as e-businesses integrate their Web presence with traditional IT services such as database management systems, it creates heightened expectations for data availability. And the DBA will be charged with maintaining that high level of availability. In fact, BMC Software has coined a word to express the increased availability requirements of web-enabled databases: e-vailability.

What is Implied by e-vailability? The term e-vailability describes the level of availability necessary to keep an e-business continuously operational. Downtime and outages are the enemy of e-vailability. There are two general causes of application downtime: planned outage and unplanned outage.

The First Important Issue is Availability 171

Historically, unplanned outages comprised the bulk of application downtime. These outages were the result of disasters, operating system crashes, and hardware failures. However, this is simply not the case any more. In fact, today most outages are planned outages, caused by the need to apply system maintenance or make changes to the application, database, or software components. Refer to Figure 1. Fully 70 per cent of application downtime is caused by planned outages to the system. Only 30 per cent is due to unplanned outages.

Figure 1: Downtime Versus Availability

Industry analysts at the Gartner Group estimate that as much as 80% of application downtime is due to application software failures and human error (see Figure 2). Hardware failures and operating system crashes were common several years ago, but today's operating systems are quite reliable, with a high mean time between failures. What does all of this mean for the eDBA? Well, the first thing to take away from this discussion is: "Although it is important to plan for recovery from unplanned outages, it is even more important to minimize downtime resulting from planned outages. This is true because planned outages occur more frequently and therefore can have a greater impact on e-vailability than unplanned outages."


How can an eDBA reduce downtime associated with planned outages? The best way to reduce downtime is to avoid it. Consider the following technology and software to avoid the downtime traditionally associated with planned outages.

Figure 2. Causes of Unplanned Application Downtime (source: Gartner Group)

Whenever possible, avoid downtime altogether by managing databases while they are online. One example is concurrent database reorganization. Although traditional reorganization scripts and utilities require the database objects to be offline (which results in downtime) new and more efficient reorganization utilities are available that can reorg data to a mirror copy and then swap the copies when the reorg process is complete. If the database can stay online during the reorg process, downtime is eliminated. These techniques require significantly more disk space, but will not disrupt an online business. Another example of online database administration is tweaking system parameters. Every DBMS product provides system parameters that control the functionality and operation of the DBMS. For example, the DSNZPARMs in DB2 for OS/390 or the init.ora parms in Oracle. Often it is necessary to bring

What is Implied by e-vailability? 173

the DBMS down and restart it to make changes to these parameters. In an e-business environment this downtime can be unacceptable. There are products on the market that enable DBMS system parameters to be modified without recycling the DBMS address spaces. Depending upon the e-business applications impacts, the affected system parameters, and the severity of the problem, a single instance where the system parameters can be changed without involving an outage can cost justify the investment in this type of management tool. Sometimes downtime cannot be avoided. If this is the case, you should strive to minimize downtime by performing tasks faster. Be sure that you are using the fastest and least error-prone technology and methods available to you. For example, if a third party RECOVER, LOAD, or REORG utility can run from one half to one quarter of the time of a traditional database utility, consider migrating to the faster technology. In many cases the faster technology will pay for itself much quicker in an e-business because of the increased availability requirements. Another way to minimize downtime is to automate routine maintenance tasks. For example, changing the structure of a table can be a difficult task. The structure of relational databases can be modified using the ALTER statement, but the ALTER statement, however, is a functionally crippled statement. It cannot alter all of the parameters that can be specified for an object when it is created. Most RDBMS products enable you to add columns to an existing table but only at the end; further you cannot remove columns from a table. The table must be dropped, then re-created without the columns targeted for removal. Another problem that DBAs encounter in modifying relational structures is the cascading drop effect. If a change to a database object mandates it being


dropped and re-created, all dependent objects are dropped when the database object is dropped. This includes tables, all indexes on the tables, all primary and foreign keys, any related synonyms and views, any triggers, and all authorization. Tools are available that allow you to make any desired change to a relational database using a simple online interface. By pointing, clicking, and selecting using the tool, scripts are generated that understand the correct way to make changes to the database. When errors are avoided using automation, downtime is diminished, resulting in greater e-vailability.

The Impact of Downtime on an e-business Downtime is truly the insidious villain out to ruin e-businesses. To understand just how damaging downtime can be to an e-business, consider the series of outages taken by eBay in 1999. As the leading auction site on the Internet, eBay's customers are both the sellers and buyers of items put up for bid on its Web site. The company's business model relies on the Web as a mechanism for putting buyers in touch with sellers. If buyers cannot view the items up for sale, the model ceases to work. From December 1998 to June 1999 the eBay web site was inaccessible for at least 57 hours caused by the following: December 7 Storage software fails (14 hours) December 18 Database server fails (3 hours) March 15 Power outage shuts down ISP May 20 CGI Server fails (7 hours) May 30 Database server fails (3 hours) June 9 New UI goes live; database server fails (6 hours) June 10 Database server fails (22 hours)

The Impact of Downtime on an e-business 175

June 12 New UI and personalization killed June 13-15 Site taken offline for maintenance (2 hours)

These problems resulted in negative publicity and lost business. Some of these problems required data to be restored. eBay customers could not reliably access the site for several days. Auction timeframes had to be extended. Bids that might have been placed during that timeframe were lost. eBay agreed to refund all fees for all auctions on its site during the time when its systems were down. To recover from this series of outages eBay's profits were impacted by an estimated $5 million in refunds and auction extensions. This, in turn, caused the stock to drop from a high of $234 in April to the $130 range in mid-July. Don't misunderstand and judge eBay too harshly though. eBay is a great site, a good business model, and a fine example of an e-business. But better planning and preparation for "e-database administration" could have reduced the number of problems they encountered.

Conclusion These are just a few techniques eDBAs can use to maintain high e-vailability for their web-enabled applications. Read this column every month for more tips, tricks, and techniques on achieving e-vailability, and migrating your DBA skills to the web.


eDatabase Recovery Strategy

CHAPTER

21 The eDBA and Recovery

As I have discussed in this column before, availability is the most important issue faced by eDBAs in managing the database environment for an e-business. An e-business, by definition, is an online business - and an online business should never close. Customers expect Web applications to deliver full functionality regardless of the day of the week or the time of day. And never forget that the Web is worldwide - when it is midnight in New York it is still prime time in Singapore. Simply put, an e-business must be available and operational 24 hours a day, 365 days a year. An e-business must be prepared to engage with customers at any time or risk losing business to a company whose website is more accessible. Studies show that if a

RAID Levels

There are several levels of RAID that can be implemented. RAID Level 0 (or RAID-0) is also commonly referred to as disk striping. With RAID-0, data is split across multiple drives, which delivers higher data throughput. But there is no redundancy (which really doesn't fit the definition of the RAID acronym). Because there is no redundant data being stored, performance is usually very good, but a failure of any disk in the array will result in data loss.

The eDBA and Recovery 177

Web user clicks on a link and doesn't receive a transmission back to his browser within seven seconds, he will go somewhere else. Chances are that customer will never come back if his needs were satisfied elsewhere. Outages result in lost business, and lost business can spell doom for an e-business. Nevertheless, problems will happen, and problems can cause outages. You can plan for many contingencies, and indeed you should plan for as many as are fiscally reasonable. But regardless of the amount of upfront planning, eventually problems will occur. And when problems impact data, databases will need to be recovered. Therefore, the eDBA must be prepared to resolve data problems by implementing a sound strategy for database recoveries. But this is good advice for all DBAs, not just eDBAs. The eDBA must take database recovery planning to a higher level - a level that anticipates failure with a plan to reduce (perhaps eliminate) downtime

RAID-1, sometimes referred to as data mirroring, provides redundancy because all data is written to two or more drives. A RAID-1 array will generally perform better when reading data and worse when writing data (as compared to a single drive). However, RAID-1 provides data redundancy so if any drive fails, no data will be lost. RAID-2 provides error correction coding. RAID-2 would be useful only for drives without any built-in error detection. RAID-3 stripes data at a byte level across several drives, with parity stored on one drive. RAID-3 provides very good data transfer rates for both reads and writes.


during recovery. The truth of the matter is that an outage-less recovery is usually not possible in most shops today. Sometimes this is the fault of technology and software deficiencies. However, in many cases, technology exists that can reduce downtime during a database recovery, but is not implemented due to budget issues or lack of awareness on the part of the eDBA.

eDatabase Recovery Strategies

A database recovery strategy must plan for all types of database recovery because problems can impact data at many levels and in many ways. Depending upon the nature of the problem and its severity, integrity problems can occur at any place within the database. Several rows, or perhaps only certain columns within those rows, may be corrupted. This type of problem is usually caused by an application error. An error can occur that impacts an entire database object such as a

RAID-4 stripes data at a block level across several drives, with parity stored on a single drive. For RAID-3 and RAID-the parity informatioallows recovery frothe failure of any sindrive. The performance of write can be slow with RAID-4 and it can be quite difficult to rebuild data in the event of RAID-4 disk failure.

4, n

m gle

RAID-5 is similar to RAID-4, but it distributes the parity information among the drives. RAID-5 can outperform RAID-4 for small writes in multiprocessing systems because the parity disk does not become a bottleneck. But read performance can suffer because the parity information is on several disks.

eDatabase Recovery Strategies 179

table, data space, or table space becoming corrupted. This type of problem is likely to be caused by an application error or bug, a DBMS bug, an operating system error, or a problem with the actual file used by the database object. More severe errors can impact multiple database objects, or even worse, an entire database. A large program glitch, hardware problem or DBMS bug can cause integrity problems for an entire database, or depending on the scale of the system, multiple databases may be impacted. Sometimes small data integrity problems can be more difficult to eradicate than more massive problems. For example, if only a small percentage of columns of a specific table are impacted it may take several days to realize that the data is in error. However, problems that impact a larger percentage of data are likely to be identified much earlier. In general, the earlier an error is found, the more recovery options available to the eDBA and the easier it is to correct the data. This

RAID-6 is basically an extension of RAID-5, but it provides additional fault tolerance through the use of a second independent distributed parity scheme. Write performance of RAID- can be poor.

ed

1

ID-1. A

for

an be very expensive.

ed

-3

RAID-53 has the same

6 RAID-10 is a striparray where each segment is a RAID-array. Therefore, it provides the same fault tolerance as RAhigh degree of performance andreliability can be delivered by RAID-10, so it is very suitablehigh performance database processing. However, RAID-10 c RAID-53 is a striparray where each segment is a RAIDarray. Therefore,


is true because transactions performed subsequent to the problem may have changed other areas of the database, and may even have changed other data based on the incorrect values.

Recovery-To-Current A useful database recovery strategy must plan for many different types of recovery. The first type of recovery that usually comes to mind is a recovery-to-current to handle some sort of disaster. This disaster could be anything from a simple media failure to a natural disaster destroying your data center. Applications may be completely unavailable until the recovery is complete. These days, outages due to simple media failures can often be avoided by implementing modern disk technologies such as RAID. RAID, an acronym for Redundant Arrays of Inexpensive Disks, is a technology that combines multiple disk devices into a single array that is perceived by the system as a single disk drive. There are many levels of RAID

fault tolerance and overhead as RAID-3. Finally, RAID-0+1 combines the mirroring of RAID-1 with the striping of RAID-0. This couples the high performance of RAID-0 with the reliability of RAID-1. In some cases storage vendors come up with their own variants of RAID. Indeed, there are a number of proprietary variants and levels of RAID defined by the storage vendors. If you are in the market for RAID storage, be sure you understand exactly what the storage vendor is delivering. For more detailsout the detailedinformation

, check

at RAID.edu.

Recovery-To-Current 181

technology and, depending on the level in use, different degrees of fault-tolerance that are supported. For more details on RAID, please see the accompanying sidebar. Another desirable aspect of RAID arrays is the ability to use hot swappable drives so the array does not have to be powered down to replace a failed drive. Instead, a drive can be replaced while the array is up and running - and that is a good thing for eDBAs because it enhances overall data availability. A disaster that takes out your data center is the worst of all possible situations and will definitely result in an outage of some considerable length. The length of the outage will depend greatly on the processes in place to send database copies and database logs to an off-site location. Overall downtime for a disaster also depends a good deal on how comprehensive and automated your recovery procedures are at the remote site. The eDBA should be prepared with automated procedures for handling a disaster. But simple automation is insufficient. The eDBA must ensure the consistent backup and offsite routing of not just all of the required data, but also the IT infrastructure resources required to bring up the organization's databases at the remote site. This is a significant task that requires planning, periodic testing and vigilance. The better the plan, the shorter the outage and the smaller the impact will be on the e-business. Consider purchasing and deploying DBA tools that automate backup and recovery processes to shorten the duration of a disaster recovery scenario.


Of course, other considerations are involved if your entire data center has been destroyed. The resumption of business will involve much more than being able to re-deploy your databases and get your applications back online. But those topics are outside the scope of this particular article.

Point-in-Time Recovery Another type of database recovery is a Point-in-Time (PIT) recovery. PIT recovery usually is performed to deal with application level problems. Conventional techniques to perform a point-in-time recovery will remove the effects of all transactions performed since a specified point in time. The traditional approach will involve an outage. Steps for PIT recovery include: 1. Identifying the point in time to which the database should

be recovered. Depending on the DBMS being used, this can be to an actual time, an offset on the database log, or to a specific image copy backup (or set of backups). Care must be taken to ensure that the PIT selected for recovery will provide data integrity, not just for the database object impacted, but for all related database objects as well.

2. The database objects must be taken off-line while the recovery process applies the image copy backups.

3. If the recovery is to a PIT later than the time the backup was taken, the DBMS must roll forward through the database logs applying the changes to the database objects.

4. When complete, the database objects can be brought back online.

The outage will last as long as it takes to complete steps 2 through 4. Depending on the circumstances, you might want to make the database objects unavailable for update immediately

Point-in-Time Recovery 183

upon discovering data integrity problems so that subsequent activities do not make the situation worse. In that case, the outage will encompass Steps 1 through 4. Further problems can ensue if there were some valid transactions after the PIT selected that still need to be applied. In that case, an additional step (say, Step 5) should be added to re-run appropriate transactions. That is, if the transactions can even be identified and re-running is a valid option. Overall, the quicker this entire process can be accomplished the shorter the outage. Step 1 can take a lot of time and the more it can be automated the better. Tools exist which make it easier to interpret database logs and identify an effective PIT for recovery. For the e-business, this type of tool can pay for itself after a single usage if it significantly reduces an outage and enables the e-business application to come back online quickly.

Transaction Recovery A third type of database recovery exists for e-businesses willing to invest in sophisticated third-party recovery solutions. Transaction Recovery addresses the shortcomings of traditional recoveries by reducing or eliminating downtime and avoiding the loss of good data. Simply stated, Transaction Recovery is the process of removing the undesired effects of specific transactions from the database. This statement, while simple on the surface, hides a bevy of complicated details. Let's examine the details behind the concept of Transaction Recovery. Traditional recovery is at the database object level: for example, at the data space, table space or index level. When performing a


traditional recovery, a specific database object is chosen. Then, a backup copy of that object is applied, followed by reapplying log entries for changes that occurred after the image copy was taken. This approach is used to recover the database object to a specific, desired point in time. If multiple objects must be recovered, this approach is repeated for each database object impacted. Transaction recovery uses the database log instead of image copy backups. Remember that all changes made to a relational database are captured in the database log. So, if the change details can be read from the log, recovery can be achieved by reversing the impact of the logged changes. Log-based transaction recovery can take two forms: UNDO recovery or REDO recovery. For UNDO recovery, the database log is read to find the data modifications that were applied during a given timeframe and: INSERTs are turned into DELETEs Deletes are turned into Inserts UPDATEs are turned around to UPDATE to the old value

In effect, an UNDO recovery reverses database modifications using SQL. The traditional DBMS products do not provide native support for this. To generate UNDO recovery SQL, you will need a third-party solution that understands the database log format and can create the SQL needed to undo the data modifications. An eDBA should note that in the case of UNDO Transaction Recovery, the portion of the database that does not need to be

Transaction Recovery 185

recovered remains undisturbed. When undoing erroneous transactions, recovery can be done online without suffering an outage of the application or the database. UNDO Transaction Recovery is basically an online database recovery. Of course, whether or not it is desirable to keep the database online during a Transaction Recovery will depend on the nature and severity of the database problem. The second type of Transaction Recovery is REDO Transaction Recovery. This strategy is a combination of PIT recovery and UNDO Transaction Recovery with a twist. Instead of generating SQL for the bad transaction that we want to eliminate, we generate the SQL for the transactions we want to save. Then we do a standard PIT recovery eliminating all the transactions since the recovery point. Finally we reapply the good transactions captured in the first step. Unlike the UNDO process, which creates SQL statements that are designed to back out all of the problem transactions, the REDO process re-creates SQL statements that are designed to reapply only the valid transactions from a consistent point of recovery to the current time. Since the REDO process does not generate SQL for the problem transactions, performing a recovery and then executing the REDO SQL can restore the data to a current state that does not include the problem transactions. A REDO Transaction Recovery requires an outage for the PIT recovery. When redoing transactions in an environment where availability is crucial, the database can be brought down during the PIT recovery and when done, the database can brought back online. The subsequent redoing of the valid transactions


to complete the recovery can be done with the data online, thereby reducing application downtime. In contrast with the granularity provided by traditional recovery, Transaction Recovery allows a user to recover a specific portion of the data based on user-defined criteria. So only a portion of the data is affected. And any associated indexes are automatically recovered as the transaction is recovered. Additionally, with Transaction Recovery the transaction may impact data in multiple database objects. A traditional recovery is performed object by object through the database. A transaction is a set of related operations that, when grouped together, define a logical unit of work within an application. Transactions are defined by the user's view of the process. This might be the set of panels that comprise a new hire operation. Or perhaps the set of jobs that post to the General Ledger. Examples of user-level transaction definitions might be: All Updates issued by userid DSGRNTLD since last

Wednesday at 11:50 AM. All Deletes made by the application program PAYROLL

since 8:00 PM yesterday. Why is Transaction Recovery a much-needed tool in the arsenal of eDBAs? Well, applications are prone to all types of problems, bugs and errors. Using Transaction Recovery, the DBA can quickly react to application-level problems and maintain a higher degree of data availability. The database does not always need to be taken off-line while Transaction Recovery occurs (it depends on the type of Transaction Recovery being performed and the severity of the problem).

Transaction Recovery 187

Choosing the Optimum Recovery Strategy So, what is the best recovery strategy? Of course, the answer is - it depends. While Transaction Recovery may seem like the answer to all your database recovery problems, there are times when it is not possible or not advisable. To determine the type of recovery to choose, you need to consider several questions: Transaction Identification. Can all the problem transactions

be identified? You must be able to actually identify the transactions that will be removed from the database. Can all the work that was originally done be located and redone? Data Integrity. Has anyone else updated the rows since the

problem occurred? If they have, can you still proceed? Is all the data required still available? Recovering after a REORG, LOAD or mass DELETE may require the use of image copy backups. Will any other data be lost? If so, can the lost data be identified in some fashion? Availability. How fast can the application become available

again? Can you afford to go off-line? What is the business impact of the outage?

These questions actually boil down to a matter of cost. What is the cost of rework and is it actually possible to determine what would need to be redone (what jobs to run, what documents to reenter)? This cost needs to be balanced against the cost of long scans of log data sets to isolate data to redo or undo, and the cost of applying that data using SQL. The ultimate database recovery solution should analyze your overall environment and the transactions needing to be recovered, and recommend which type of recovery to perform. Furthermore, it should automatically generate the appropriate scripts and jobs to perform the recovery to avoid the errors


that are sure to be introduced with manually developed scripts and jobs.

Database Design In some cases you can minimize the impact of future database problems by properly designing the database for the e-business application that will use the database. For example, you might be able to segment or partition the database by type of customer, location, or some other business criterion whereby only a portion of the database can be taken off-line while the rest remains operational. In this way, only certain clients will be affected, not the entire universe of users. Of course, this approach is not always workable, but sometimes "up front" planning and due diligence during database design can mitigate the impact of future problems.

Reducing the Risk These are just a few of the recovery techniques available to eDBAs to reduce outages and the impact of downtime for e-businesses. For example, some disk storage devices provide the capability to very quickly "snap" files using hardware techniques - the result being very fast image copy backups. Some recovery solutions work well with these new, smart storage devices and can "snap" the files back very quickly as well. Other solutions exist that back out transactions from the log to perform a database recovery. For eDBAs, a backout recovery may be desired in instances where a problem is identified quickly. You may be able to decrease the time required to

Database Design 189

recover by backing out the effects of a bad transaction instead of going back to an image copy and rolling forward through the log. The bottom line is, as an eDBA you need to keep up-to-date with the technology available to reduce outages - both hardware and software offerings - and you need to understand how these technologies can work with your database environment. Remember that recovery does not always have to involve an outage. Think creatively, plan accordingly and deploy diligently, and you can deliver the service required of e-database administration. With proper planning and wise implementation of technologies that minimize outages, you can maintain high availability for your Web-enabled databases and applications.


Automating eDBA Tasks

CHAPTER

22 Intelligent Automation of DBA Tasks

It is hard to get good help these days. There are more job openings for qualified, skilled IT professionals than there are individuals to fill the jobs. And one of the most difficult IT positions to fill is the DBA. DBAs are especially hard to recruit because the skills required to be a good DBA span multiple disciplines. These skills are difficult to acquire, and to make matters more difficult, the required skill set of a DBA is constantly changing. To effectively manage enterprise databases, a DBA must understand both the business reasons for storing the data in the database and the technical details of how the data is structured and stored. The DBA must understand the business purpose for the data to ensure that it is used appropriately and is accessible when the business requires it to be available. Appropriate usage involves data security rules, user authorization, and ensuring data integrity. Availability involves database tuning, efficient application design, and performance monitoring and tuning. These are difficult and complicated topics. Indeed, entire books have been dedicated to each of these topics.

Intelligent Automation of DBA Tasks 191

Duties of the DBA The technical duties of the DBA are numerous. These duties span the realm of IT disciplines from logical modeling to physical implementation. DBAs must possess the abilities to create, interpret, and communicate a logical data model and to create an efficient physical database design from a logical data model and application specifications. There are many subtle nuances involved that make these tasks more difficult than they sound. And this is only the very beginning. DBAs also need to be able to collect, store, manage, and query data about the data (metadata) in the database and disseminate it to developers that need the information to create effective application systems. This may involve repository management and administration duties, too. After a physical database has been created from the data model, the DBA must be able to manage that database once it has been implemented. One major aspect of this management involves performance management. A proactive database monitoring approach is essential to ensure efficient database access. The DBA must be able to utilize the monitoring environment, interpret its statistics, and make changes to data structures, SQL, application logic, and the DBMS subsystem to optimize performance. And systems are not static, they can change quite dramatically over time. So the DBA must be able to predict growth based on application and data usage patterns and implement the necessary database changes to accommodate the growth. And performance management is not just managing the DBMS and the system. The DBA must understand SQL, the standard relational database access language. Furthermore, the DBA must be able to review SQL


and host language programs and to recommend changes for optimization. As databases are implemented with triggers, stored procedures, and user-defined functions, the DBA must be able to design, debug, implement, and maintain the code-based database objects as well. Furthermore, data in the database must be protected from hardware, software, system, and human failures. The ability to implement an appropriate database backup and recovery strategy based on data volatility and application availability requirements is required of DBAs. Backup and recovery is only a portion of the data protection story, though. DBAs must be able to design a database so that only accurate and appropriate data is entered and maintained - this involves creating and managing database constraints in the form of check constraints, rules, triggers, unique contraints, and referential integrity. Additionally, DBAs are required to implement rigorous security schemes for production and test databases to ensure that only authorized users have access to data. And there is more! The DBA must possess knowledge of the rules of relational database management and the implementation of many different DBMS products. Also important is the ability to accurately communicate them to others. This is not a trivial task since each DBMS is different than the other and many organizations have multiple DBMS products (e.g., DB2, Oracle, SQL Server). And, remember, the database does not exist in a vacuum. It must interact with other components of the IT infrastructure. As such, the DBA must be able to integrate database administration requirements and tasks with general systems management requirements and tasks such as network

Duties of the DBA 193

management, production control and scheduling, and problem resolution, to name just a few systems management disciplines. The capabilities of the DBA must extend to the applications that use databases, too. This is particularly important for complex ERP systems that interface differently with the DBMS. The DBA must be able to understand the requirements of the application users and to administer their databases to avoid interruption of business. This includes understanding how any ERP packages impact the business and how the databases used by those packages differ from traditional relational databases.

A Lot of Effort Implementing, managing, and maintaining complex database applications spread throughout the world is a difficult task. To support modern applications a vast IT infrastructure is required that encompasses all of the physical things needed to support your applications. This includes your databases, desktops, networks, and servers, as well as any networks and servers outside of your environment that you rely on for e-business. These things, operating together, create your IT infrastructure. These disparate elements are required to function together efficiently for your applications to deliver service to their users. But these things were not originally designed to work together. So not only is the environment increasingly complex, it is inter-related. But it is not necessarily designed to be inter-related. When you change one thing, it usually impacts others. What is the impact of this situation on DBAs? Well, for starters, DBAs are working overtime just to support the current applications and relational features. But new


RDBMS releases are being made available faster than ever before. Microsoft is feverishly working on a new version of SQL Server right on the heels of the recently released SQL Server 2000. And IBM has announced DB2 Version 8, even though Version 7 was just released last year and many users have not yet migrated to it. So, the job of database administration is getting increasingly more difficult as database technology rapidly advances, adding new functionality, more options, and more complex and complicated capabilities. But DBAs are overworked, under-appreciated, and lack the time to gain the essential skills required to support and administer the latest features of the RDBMS they support. What can be done?

Intelligent Automation One of the ways to reduce these problems is through intelligent automation. As IT professionals we have helped to deliver systems that automate multiple jobs throughout our organizations. That is what computer applications do: they automate someone's job to make that job easier. But we have yet to intelligently automate our DBA jobs. By automating some of the tedious day-to-day tasks of database administration, we can free up some time to learn about new RDBMS features and to implement them appropriately. But simple automation is not sufficient. The software should be able to intelligently monitor, analyze, and optimize applications using past, present, and future analysis of collected data. Simply stated, the software should work the way a consultant works--fulfilling the role of a trusted advisor.

Intelligent Automation 195

This advisor software should collect data about the IT environment from the systems (e.g., OS, DBMS, OLTP), objects, and applications. It should require very little initial configuration, so that it is easy to use for novices and skilled users alike. It should detect conditions requiring maintenance actions, and then advise the user of the problem, and finally, and most beneficial to the user, optionally perform the necessary action to correct the problems it identifies. Most management tools available today leave this analysis and execution up to the user. But intelligent automation solutions should be smart enough to optimize and streamline your IT environment with minimal, perhaps no, user or DBA interaction. The end result - software that functions like a consultant - enables the precious human resources of your organization to spend time on research, strategy, planning, and implementing new and advanced features and technologies. Only through intelligent automation will we be able to deliver on the promise of technology.

Synopsis As IT tasks get more complex and IT professionals are harder to employ and retain, more and more IT duties should be automated using intelligent management software. This is especially true for very complex jobs, such as DBA. Using intelligent automation will help to reduce the amount of time, effort, and human error associated with managing databases and complex applications.


Where to Turn for Help

CHAPTER

23 Online Resources of the eDBA

As DBAs augment their expertise and skills to better prepare to support Web-enabled databases and applications, they must adopt new techniques and skills. We have talked about some of those skills in previous eDBA columns. But eDBAs have additional resources at their disposal, too. By virtue of being Internet-connected, an eDBA has access to the vast knowledge and experience of his peers. To take advantage of these online resources, however, the eDBA must know that the resources exist, how to gain access to them and where to find them. This article will discuss several of the Internet resources available to eDBAs.

Usenet Newsgroups When discussing the Internet, many folks limit themselves to the World Wide Web. However, there are many components that make up the Internet. One often-overlooked component is the Usenet Newsgroup. Usenet Newsgroups can be a very fertile source of expert information. Usenet, an abbreviation for User Network, is a large collection of discussion groups called newsgroups. Each newsgroup is a collection of articles pertaining to a single, pre-determined topic. Newsgroup names usually reflect their focus. For example, comp.databases.ibm-db2 contains discussions about the DB2 Family of products.

Online Resources of the eDBA 197

Using News Reader software, any Internet user can access a newsgroup and read the information contained therein. Refer to Figure 1 for an example using the Forte Free Agent news reader to view messages posted to comp.databases.ibm-db2. The Free Agent news reader can be downloaded and used free of charge from the Forte website at www.forteinc.com. Netscape navigator also provides news reader functionality. There are many newsgroups that focus discussion on database and database-related issues. The following table shows some of the most pertinent newsgroups of interest to the eDBA. Database-Related Usenet Newsgroups of Interest to eDBAs: NEWSGROUP NAME DESCRIPTION comp.client-server Information on client/server

technology comp.compression.research Information on research in

data compression techniques

comp.data.administration Discussion of data modeling and data administration issues

comp.databases Issues regarding databases and data management

comp.databases.ibm-db2 Information on IBM's DB2 family of products

comp.databases.informix Information on the Informix DBMS

comp.databases.ms-sqlserver Information on Microsoft's SQL Server DBMS

comp.databases.object Information on object-oriented database systems


NEWSGROUP NAME DESCRIPTION comp.databases.olap Information on data

warehouse online analytical processing

comp.databases.oracle.marketplace Information on the Oracle market

comp.databases.oracle.server Information on the Oracle RDBMS

comp.databases.oracle.tools Information regarding add-on tools for Oracle

comp.databases.oracle.misc Miscellaneous Oracle discussions

comp.databses.sybase Information on the Sybase Adaptive Server RDBMS

comp.databases.theory Discussions on database technology and theory

comp.edu Computer science education comp.misc General computer-related

discussions comp.unix.admin UNIX administration

discussions comp.unix.questions Question and answer forum

for UNIX novices bit.listserv.cics-1 Information pertaining to the

CICS transaction server bit.listserv.dasig Database administration

special interest group bit.listserv.db2-1 Information pertaining to

DB2 (mostly mainframe) bit.listserv.ibm-main IBM mainframe newsgroup

Usenet Newsgroups 199

Of course, thousands of other newsgroups exist. You can use your news reader software to investigate the newsgroups available to you and to gauge the quality of the discussions conducted therein.

Mailing Lists Another useful Internet resource for eDBAs is the mailing list. Mailing Lists are a sort of community bulletin board. You can think of mailing lists as somewhat equivalent to a mass mailing. But mailing lists are not spam because users must specifically request to participate before they will receive any mail. This is known as "opting in." There are more than 40,000 mailing lists available on the Internet, and they operate using a list server. A list server is a program that automates the mailing list subscription requests and messages. The two most common list servers are Listserv and Majordomo. Listserv is also a common synonym for mailing list, but it is actually the name of a particular list server program. Simply by subscribing to a mailing list, information will be sent directly to your e-mail in-box. After subscribing to a mailing list, e-mails will begin to arrive in your in-box from the remote computer called the list server. The information that you will receive varies - from news releases, to announcements, to questions, to answers. This information is very similar to the information contained in a news group forum, except that it comes directly to you via e-mail. Users can also respond to mailing list messages very easily enabling communication with every subscribed user. Responses


are sent back to the list server as e-mail, and the list server sends the response out to all other members of the mailing list. To subscribe to a mailing list, simply send e-mail to the appropriate subscription address requesting a subscription. There are several useful websites that catalog and document the available Internet mailing lists. Some useful sites include CataList and listTool. Of course, none of these sites track every single mailing list available to you. Vendors, consultants, Web portals and user groups also support mailing lists of various types. The only way to be sure you know about all the useful mailing lists out there is to become an actively engaged member of the online community. The following list provides details on a few popular database-related mailing lists for eDBAs: MAILING LIST NAME

SUBSCRIPTION ADDRESS DESCRIPTION

[email protected] E-mail [email protected] with the command: SUBSCRIBE ORACLE-L

Discussion about the Oracle DBMS

[email protected] E-mail [email protected] with the command: SUBSCRIBE DB2-L

Discussion about the DB2 Family of products

[email protected]

E-mail [email protected] with the command: SUBSCRIBE SYBASE-L

Discussion of SYBASE Products, Platforms & Usage

[email protected]

E-mail to [email protected] with the command: SUBSCRIBE VBDATA-L

Discussion for Microsoft Visual Basic Data Access

Websites and Portals Of course, the Web is also a very rich and fertile source of database and DBA related information. But tracking things down on the Web can sometimes be difficult - especially if you

Websites and Portals 201

do not know where to look. Several good sources of DBMS information on the Web can be found by reviewing the websites of DBMS vendors, DBA tool vendors, magazine sites and consultant sites. For example, check out the following: IBM DB2 (http://www-4.ibm.com/software/data/db2/) Oracle (http://www.oracle.com/) Microsoft SQL Server

(http://www.microsoft.com/sql/default.asp) BMC Software (http://www.bmc.com/) Oracle Magazine (http://www.bmc.com/) DB2 Magazine (http://www.db2mag.com/) Database Trends (http://www.databasetrends.com/) Data Management Review (http://www.dmreview.com/) Yevich, Lawson & Associates

(http://207.0.61.219/ylassoc/) TUSC (http://www.tusc.com/) DBA Direct (http://www.dbadirect.com/) My website (http://www.craigmullins.com/)

These types of sites are very useful for obtaining up-to-date information about DBMS releases and version, management tool offerings, and the like, but sometimes the information on these types of sites is very biased. For information that is more likely to be unbiased you should investigate the many useful Web portals and Web magazines that focus on DBMS technology. Of course, this website, www.dbazine.com, is a constant source of useful information about database administration and data


No eDBA Is an Island 203

warehouse management issues and solutions. There are several other quite useful database-related sites that are worth investigating including: Searchdatabase.com (http://www.searchdatabase.com/) The Data Administration Newsletter

(http://www.tdan.com/) The Journal of Conceptual Modeling

(http://www.inconcept.com/JCM/about.html)

No eDBA Is an Island The bottom line is that eDBAs are not alone in the Internet-connected world. It is true that the eDBA is expected to perform more complex administrative tasks in less time and with minimal outages. But fortunately the eDBA has a wealth of help and support that is just a mouse click away. As an eDBA you are doing yourself a disservice if you do not take advantage of the Internet resources at your disposal.

the data warehousing ebusiness dba...

Documents