enhancing data quality in data warehouse environments

6
COMMUNICATIONS OF THE ACM January 1999/Vol. 42, No. 1 73 D ecisions by senior management lay the groundwork for lower corporate levels to develop policies and proce- dures for various corporate activities. However, the potential business con- tribution of these activities depends on the quality of the decisions and in turn on the quality of the data used to make them. Some inputs are judg- mental, others are from transactional systems, and still others are from external sources, but all must have a level of quality appropriate for the decisions they will be part of. Although concern about the qual- ity of one’s data is not new, what is fairly recent is using the same data for multiple purposes, which can be quite different from their original purposes. Users working with a par- ticular data set come to know and internalize its deficiencies and idio- syncrasies. This knowledge is lost when data is made available to other parties, like when data needed for decision making is collected in repos- itories called data warehouses. Here, we offer a conceptual frame- work for enhancing data quality in data warehouse environments. We explore the factors that should be considered, such as the current level of data quality, the levels of quality needed by the relevant decision processes, and the potential benefits of projects designed to enhance data quality. Those responsible for data quality have to understand the importance of such factors, as well as the interaction among them. This understanding is mandatory in data warehousing environments charac- terized by multiple users with differ- ing needs for data quality. For warehouses supporting a lim- ited number of decision processes, awareness of these issues coupled with good judgment should suffice. For more complex situations, how- ever, the number and diversity of trade-offs make reliance on judgment alone problematic. For such situa- tions, we offer a methodology that Nothing is more likely to undermine the performance and business value of a data warehouse than inappropriate, misunderstood, or ignored data quality. Enhancing DataQuality in DataWarehouse Environments Donald P. Ballou and Giri Kumar Tayi

Upload: giri-kumar

Post on 23-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

COMMUNICATIONS OF THE ACM January 1999/Vol. 42, No. 1 73

Decisions by seniormanagement laythe groundworkfor lower corporatelevels to developpolicies and proce-

dures for various corporate activities.However, the potential business con-tribution of these activities dependson the quality of the decisions and inturn on the quality of the data usedto make them. Some inputs are judg-mental, others are from transactionalsystems, and still others are fromexternal sources, but all must have alevel of quality appropriate for thedecisions they will be part of.

Although concern about the qual-ity of one’s data is not new, what isfairly recent is using the same data formultiple purposes, which can bequite different from their originalpurposes. Users working with a par-ticular data set come to know andinternalize its deficiencies and idio-syncrasies. This knowledge is lostwhen data is made available to otherparties, like when data needed for

decision making is collected in repos-itories called data warehouses.

Here, we offer a conceptual frame-work for enhancing data quality indata warehouse environments. Weexplore the factors that should beconsidered, such as the current levelof data quality, the levels of qualityneeded by the relevant decisionprocesses, and the potential benefitsof projects designed to enhance dataquality. Those responsible for dataquality have to understand theimportance of such factors, as well asthe interaction among them. Thisunderstanding is mandatory in datawarehousing environments charac-terized by multiple users with differ-ing needs for data quality.

For warehouses supporting a lim-ited number of decision processes,awareness of these issues coupledwith good judgment should suffice.For more complex situations, how-ever, the number and diversity oftrade-offs make reliance on judgmentalone problematic. For such situa-tions, we offer a methodology that

Nothing is more likely to undermine the performance and business value of a data warehouse than inappropriate, misunderstood, or ignored data quality.

Enhancing DataQualityin DataWarehouse Environments

Donald P. Ballou and Giri Kumar Tayi

74 January 1999/Vol. 42, No. 1 COMMUNICATIONS OF THE ACM

systematically incorporates various factors, includ-ing trade-offs, and guides the data warehousingmanager in allocating resources by identifying thedata quality enhancement projects that maximizevalue to the users of the data.

Data warehousing efforts may not succeed for var-ious reasons, but nothing is more certain to yield fail-ure than lack of concern for the quality of the data.“Ignore or trivialize problems with the existing dataat the start of the project, and that oversight will bru-tally assert itself . . .” [4]. Another reason warehous-ing efforts may not succeed is failure to store theappropriate data. Although this fact seems obvious,availability is still sometimes the sole criterion forstorage. Data from outside sources and soft data maybe ignored completely, even though such data is crit-ical for many decision purposes. Although it mightseem that data availability is not a data quality issue,users who regularly work with data include dataavailability as an attribute of data quality [12].

Data supporting organizational activities in ameaningful way should be warehoused. However, aparticular data set may support several low-levelorganizational activities, whereas another supportsonly one activity but with higher priority. If a choicehas to be made as to which data should be ware-housed, how should a warehouse manager decide?Moreover, it may be relatively inexpensive to cleanup a data set that is seldom used, but expensive to

improve the quality of a frequently used data set.Again, if a choice has to be made, which data setshould be worked on? There are other kinds oftrade-offs. For example, it may be possible toimprove the timeliness of certain data at the expenseof completeness. Or it may be inexpensive toinclude some relatively unimportant data but costlyto obtain unavailable but important data. All thesetrade-offs are in the context of limited resourcesavailable for improving the data’s quality.

A distinguishing characteristic of warehouseddata is that it is used for decision making, ratherthan for operations. A particular data set may wellsupport several decision processes. Support for sev-eral processes complicates data management,because these uses are likely to require differentdegrees of data quality. Research on enhancing dataquality has for the most part been in the context ofdata supporting individual activities, usually of anoperational nature [3].

Data warehousing efforts have to address severalpotential problems. For example, data from differentsources may exhibit serious semantic differences. Aclassic case is the varying definitions of “sales”employed by different stakeholders in a corporation.Furthermore, data from various sources is likely tocontain syntactic inconsistencies, which also have tobe addressed. For example, there may well be dis-crepancies in the time periods for activity reports(such as bimonthly vs. every two weeks). Moreover,the desired data may simply not have been gathered.

A significant characteristic of data warehouses isthe prominent roles of soft data and historical data.Operational systems exclude soft data, yet such datais often critical for decision making. By “soft data,”we mean data whose quality is inherently uncertain;an example is human resources evaluations related tofuture task assignments involving subjective rank-ings. For a data warehouse to support decision mak-ing, soft data, though imperfect, has to be madeavailable. Whereas operational systems focus on cur-rent data, decision making often involves temporalcomparisons. Thus, historical data often has to beincluded in data warehouses.

It is not sufficient to state that the data is wrong ornot useful; such evaluations offer no guidance as tohow warehouse managers should go about improvingthe data. To report that the data is inconsistent indi-cates a problem rather different from saying that thedata is out of date. Part of the enhancement effort isto elicit from data users as precisely as possible whatit is about the data they consider unsatisfactory.Determining what’s wrong with the data is facilitatedby being aware of the dimensions of data quality.

Part of theenhancement

effort is to elicit fromdata users asprecisely as possible whatit is about the data they considerunsatisfactory.

It has long been recognized that data is bestdescribed or characterized via multiple attributes, ordimensions. For example, in 1985, Ballou and Pazeridentified four dimensions of data quality: accuracy,completeness, consistency, and timeliness [1]. Morerecently, Wang and Strong [12] analyzed the variousattributes of data quality from the perspective of thepeople using the data. They identified a full set ofdata quality dimensions, adding believability, valueadded, interpretability, accessibility, and others tothe earlier four. These dimensions were thengrouped into four broad categories: intrinsic, con-textual, representational, and accessibility. For exam-ple, accuracy belongs to intrinsic; completeness andtimeliness to contextual; consistency to representa-tional; and availability to accessibility.

The earliest work in data quality was carried outby the accounting profession (such as [5]). Severalstudies have looked at the social and economiceffects of inadequate data quality [6, 9]. The conse-quences of poor data quality have been exploredusing modeling approaches [1]. Various studies havesought to determine how to improve the quality ofdata [7, 8]. And other work has examined the natureof data quality [10–12].

Think SystematicallyFor enhancement efforts to be worthwhile, users anddata warehouse managers alike have to think system-atically about what is required. And organizationshave to distinguish between what is vital and what ismerely desirable. Data warehouses are developed tosupport some subset of the organization’s activities.For example, in the marketing realm, success dependson such organizational activities as market segmenta-tion, brand and sales management, and marketingstrategy. These activities are fundamental to businessprofitability and success. To make decisions, man-agers need access to both historical and soft data. His-torical data can include sales by product, region, andtime period; soft data can include customer prefer-ence surveys, fashion trends, and sales forecasts.

For data quality enhancement efforts to succeed,the data warehouse manager should first ascertainfrom decision makers what organizational activitiesthe data warehouse has to support. Some mechanismhas to be used to determine the priority of theseactivities. All other factors being equal, data qualityenhancement efforts should be directed first to theactivities with the highest priority.

Identifying data sets required to support the tar-geted organizational activities is also needed. By “dataset,” we mean some clearly distinguishable collectionof data, such as a traditional file or some aggregation of

external data. Some data sets may not currently existbut can be created at reasonable cost and effort. It isimportant to note that each data set potentially sup-ports several organizational activities simultaneously.

Finally, existing and potential problems with thedata set have to be identified. The dimensions of dataquality guide this determination. For example, if therequired data set does not exist, then its data quality isdeficient on the availability dimension. If the data existsbut for whatever reason cannot be obtained, then acces-sibility is a problem. If there are semantic discrepancieswithin the data set, then interpretability is problemati-cal. A data set can be deficient on more than onedimension. For example, a data set may not be accessi-ble and may also be incomplete.

It is assumed that various projects can be under-taken to improve the quality of data. One might beto resolve syntactic differences among customer datarecords. Another might be to obtain regional salesdata on a more timely basis. Another could be toidentify and enforce a common definition of “sales”(at least as far as the warehouse is concerned).Another could involve obtaining external dataregarding competitors’ activities—a form of soft datanot currently available. And yet another might be toextract a relevant subset of some transaction file.Each one influences the quality of one or more datasets that in turn influences one or more of the orga-nization’s activities. To make these ideas more pre-cise, we introduce the following index notation:

I. Index for organizational activities supported by a data warehouse

J. Index for data setsK. Index for data quality attributes or dimensionsL. Index for possible data quality projects

Consider the following scenario. The data ware-house supports three organizational activities: pro-duction planning (I=1), sales tracking (I=2), andpromotion effectiveness (I=3). These activities aresupported by three data sets: an inventory file (J=1),an historical sales file (J=2), and a promotional activ-ity file (J=3). In terms of the quality dimensions, theinventory file is inaccurate (K=1, accuracy) andsomewhat out of date (K=2, timeliness). However,the sales file is accurate (K=1) but has a monthly lag(K=2). The promotional activity file is incomplete(K=3, completeness), containing only aggregatedata. To enhance the quality of these files, we under-take several projects to eliminate about 50% of theerrors in the inventory file (L=1); make the sales filemore timely, even if it sacrifices some accuracy(L=2); and tie the promotional file more closely to

COMMUNICATIONS OF THE ACM January 1999/Vol. 42, No. 1 75

the actual sales (L=3). In practice, limited resourcespreclude undertaking all proposed projects.

A number of factors influence data qualityenhancement projects. In practice, however, deter-mining them precisely is quite difficult.

Current quality CQ(J,K). The current quality ofeach data set J is evaluated on each relevant dataquality dimension K. For this scenario, the currentaccuracy CQ(J=1,K=1) and timelinessCQ(J=1,K=2) of the inventory file, among others,has to be evaluated.

Required quality RQ(I,J,K). This factor representsthe level of data quality required by activity I, whichuses data set J and depends on dimension K. Datawarehousing implicitly assumes that the stored data isused for more than one purpose and the qualityrequirements for them can vary dramatically. For oneuse (such as aggregate production planning) of a par-ticular data set (sales forecasts), order-of-magnitudecorrectness may suffice; for another (such as item-levelplanning), each value must be precise. The requiredquality also depends on the data quality dimensions.

In this scenario, the sales data set (J=2) supportsthe organizational activities production planning(I=1) and sales tracking (I=2). The current timeliness(K=2) of the sales file is more than adequate for salestracking but inadequate for production planning.

For data sets used by more than one application,the current quality of a particular dimension may beinsufficient for some of the applications and morethan sufficient for others.

Anticipated quality AQ(J,K;L). This factor rep-resents the quality of data set J on dimension Kresulting from undertaking project L. The peopleresponsible for data quality can undertake projectsaffecting the quality of various data sets, but maywell affect the various dimensions of data quality indifferent ways. It is quite possible that efforts toimprove the quality of a particular dimensiondiminishes the quality on another dimension; anexample is the trade-off between accuracy and time-liness [2]. So, if project L=2 (to make sales file J=2more timely) is undertaken, then the timelinessdimension (K=2) would improve, but accuracy(K=1) would suffer.

Although data quality projects usually seek toimprove data quality, such a goal is not always onthe agenda. For example, suppose that for a certainclass of data much more data is gathered and storedthan is needed. It might well make sense to actuallylower the level of completeness for the data set. Thereason would be to save resources.

The metric used to measure each of the three fac-tors—CQ, RQ, and AQ—has to be the same. One

could use the domain [0,1], with 0 representing theworst case and 1 the best [2]. For example, in regardto the inventory file, suppose CQ(1,1) = 0.6 (60%of the items are accurate). And for project L=1,which seeks to eliminate 50% of the errors, thenAQ(1,1,1) = 0.8. Other possible metrics could bebased on ordinal scales.

Priority of organizational activity, Weight(I).Some organizational activities are clearly moreimportant than others. Weight (I) represents the pri-ority of activity I. All factors being equal, commonsense dictates that data sets supporting high-priorityactivities receive preference over those involved inlow-priority activities. Weight(I) should satisfy O<Weight(I)<1, and the weights should add up to 1.

Cost of data quality enhancement, Cost(L).The cost of undertaking project L is represented byCost(L). The various data-quality enhancementprojects involve commitment of such resources asfunds, personnel, and time, all of which are inher-ently limited. There are also various crosscurrentsbetween the Cost(L) and the Weight(I), includingthe question: Is it better to undertake a project thatis expensive and supports an activity whose weight islarge or to undertake a project that might supportseveral activities, each of moderate importance? Theonly restriction on the measurement of the Cost(L)values is that the units are the same in all cases.Cost(L) values should be viewed as the total costassociated with project L over some period of time,such as fiscal year or budget cycle. The Cost(L) val-ues include not only the cost of actual project workbut the costs associated with ongoing data-quality-related activities necessitated by the project. Thisinterpretation of Cost(L) means it is possible forCost(L) to be negative. For example, should the cur-rent quality of a particular data set be better thanrequired, so reducing the quality saves resources,resulting in a negative Cost(L)?; an example is stor-ing summary rather than unneeded detail data.

Value added, Utility(I,J,K;L). This factor repre-sents the change in value or utility for organizationalactivity I should project L be undertaken. Sinceproject L could affect different dimensions of differ-ent data sets required by activity I, the utilitydepends explicitly on each relevant data set J anddimension K. Several considerations regarding theutility values have to be addressed; for example, if aparticular project L is undertaken, any data set istotally unaffected; for such a data set, Utility(I,J,K;L)= 0. The utility can be either positive (the projectenhances data quality) or negative (the project doesnot enhance data quality). There is no value inimproving quality more than required. And finally, a

76 January 1999/Vol. 42, No. 1 COMMUNICATIONS OF THE ACM

particular project need not, and usually will not,completely remove all data quality deficiencies.

A logical consequence of these observations is thatfor each organizational activity I, there is a conceptualrelationship between the magnitude of change in dataquality and the value for Utility(I,J,K;L). In practice,determining all the functional relationships betweenthe change in data quality and Utility(I,J,K;L) wouldbe daunting. The simplest way to make that determi-nation would be to use whatever knowledge the orga-nization has to estimate all the non-zero Utility(I,J,K;L) values.In our scenario, there are3333333 = 81 poten-tial Utility(I,J,K;L) val-ues. I has three values,as do J, K, and L.However, only eightsuch values have to beestimated, as the othersare zero (see Table 1).

Obtaining the Util-ity (I,J,K;L) estimatesmay be difficult andimprecise but never-theless has to be done. After all, warehouse managerscan’t make an intelligent decision about which pro-jects should be undertaken unless they are aware ofwhat data sets and data quality dimensions areaffected by a particular project, as well as the relativemagnitude of the benefit of the effect. Even moreimportant, this process forces warehouse managers tothink systematically about data quality enhancement.

Which Projects First?For relatively simple situations, these factors andissues can guide data quality enhancement of ware-housed data. However, as the number of organiza-tional activities supported, data sets involved, andpotential enhancement projects increases, it becomesdifficult to cope with the various trade-offs and

impossible to determine an optimal project mix. Forsuch situations, an integer programming model canhelp identify the subset of projects that best enhancethe quality of the warehoused data. See Table 2 for amathematical formulation based on this model’sdescription.

Each project is associated with a weight Weight(I)and a set of utilities Utility(I,J,K;L). The weightedsum of these utilities produces an overall worth forproject L, call it Value(L). Note that Value(L) cap-tures the fact that a particular data quality project

can influence severaldimensions of thesame data set as wellas multiple data setsinvolved in variousorganizational activi-ties.

Optimizationmodels include anobjective functionthat, in this case, isthe sum of theValue(L) of allselected projects.

There are several constraints, in addition to theobjective function. For example, the sum of the costof all projects selected cannot exceed the budget(Resource Constraint). Suppose that projects (P(1),P(2), . . . , P(S) are mutually exclusive, so only onecan be carried out. A constraint is needed to select atmost one such project (Exclusiveness Constraint).Also suppose that, for example, projects P(1) andP(2) overlap in the sense they share various aspects.Then develop a new project P(3) representing bothP(1) and P(2), and select at most one of them (Inter-action Constraint). Such constraints are not neces-sarily exhaustive, and as in all modeling, it can berefined. For example, if appropriate, Cost(L) can beorganized into fixed and variable costs.

Implementation of this integer-programming

COMMUNICATIONS OF THE ACM January 1999/Vol. 42, No. 1 77

U(1,1,1;1) = U(Production Planning, Inventory File, Accuracy; Inventory File Project)U(1,2,1; 2) = U(Production Planning, Sales File, Accuracy; Sales File Project)U(1,2,2;2) = U(Production Planning, Sales File, Timeliness; Sales File Project)U(2,2,1;2) = U(Sales tracking, Sales File, Accuracy; Sales File Project)U(2,2,2;2) = U(Sales tracking, Sales File, Timeliness; Sales File project)U(3,3,3;3) = U(Promotional effectiveness, Promotional file, Completeness, Promotional file project)U(3,2,1;3) = U(Promotional effectiveness, Sales file, Accuracy; Sales file project)U(3,2,2;2) = U(Promotional effectiveness, Sales file, Timeliness; Sales file project)Format: U (I: Organizational Activity; J: Data Set; K: DQ Dimensions; L: DQ Project)

Value of Project L = S Weight(I) S S Utility(I,J,K;L)

Maximize: Total Value from all projects

S X(L) * Value(L)

Resource Constraint: S X(L) * Cost(L) # Budget

Exclusiveness Constraint: X(P(1)) + X(P(2)) + … + X(P(S)) # 1

Interaction Constraint: X(P(1)) + X(P(2)) + X(P(3)) # 1

Integer Constraints: 1 if project L is selected; 0 otherwise

X(L) = {

All 1 ALL J All K

All L

L

0

1

Table 1. Required U(I,J,K;L) values for the illustrative scenario

Table 2. Integer programming model formulation

model identifies data quality projects that wouldincrease to the greatest possible extent the utility of thewarehoused data, systematically incorporating varioustrade-offs, including the value of obtaining data setsnot currently available and evaluating gains fromreducing the quantity of stored data in certain cases.

However, the difficulty implementing any model isin obtaining values for model parameters. Therefore,for our model, the data warehouse manager has to:

• Determine the organizational activities the datawarehouse will support;

• Identify all sets of data needed to support theorganizational activities;

• Estimate the quality of each data set on each rele-vant data quality dimension;

• Identify a set of potential projects (and their cost)that could be undertaken for enhancing or affect-ing data quality;

• Estimate for each project the likely effect of thatproject on the quality of the various data sets, bydata quality dimension; and

• Determine for each project, data set, and relevantdata quality dimension the change in utilityshould a particular project be undertaken.

The first two items apply even if data quality isnot addressed. The third and fourth are required if adata warehouse manager wants to improve dataquality. A problem arises with the third item in thatthe task of measuring data quality is not yet fullysolved [11]. The difficulty in implementing themodel resides in the fifth and sixth items.

However, a model can always be simplified at theexpense of its ability to capture reality. One of themodel’s complexities arises from the need to evalu-ate the quality of the data sets on the relevantdimensions. This evaluation is necessary if trade-offsregarding the dimensions of data quality represent asignificant issue. If the trade-offs are not important,then an overall quality measure for each data set isadequate. Thus, the current quality CQ woulddepend on the data set only; the required qualityRQ on the activity and data set only; the anticipatedquality AQ on the data set and project only; and thechange in utility on the activity, data set, and projectonly. Ignoring the trade-offs would substantiallyreduce the effort of estimating parameters.

The model can be formulated using a continuousmeasurement scale, implying CQ is some valuebetween 0 and 1, and U(I,J,K;L) is a continuousfunction of the change in data quality. Implementa-tion of the model can be simplified considerably byconverting to a discrete version. For example, CQ

could be qualitatively evaluated as low, medium, orhigh, and the change in utility of a data set on a cer-tain dimension resulting from a specific project couldalso be so measured. Finally, these qualitative esti-mates—low, medium, high—would be convertedinto corresponding numerical values, such as 1, 2, 3.

ConclusionData warehousing depends on integrating data qual-ity assurance into all warehousing phases—plan-ning, implementation, and maintenance. Sincewarehoused data is accessed by users with varyingneeds, data quality activities should be balanced tobest support organizational activities. Therefore, thewarehouse manager has to be aware of the variousissues involved in enhancing data quality. That’swhy we presented ideas and described a model tosupport quality enhancement while highlighting rel-evant issues and factors. Combined, they are aframework for thinking comprehensively and sys-tematically about data quality enhancement in datawarehouse environments.

References1. Ballou, D., and Pazer, H. Modeling data and process quality in multi-

input, multi-output information systems. Management Science 31, 2(Feb. 1985), 150–162.

2. Ballou, D., and Pazer, H. Designing information systems to optimizethe accuracy-timeliness trade-off. Information Systems Research 6, 1(Mar. 1995), 51–72.

3. Ballou, D., and Tayi, G. Methodology for allocating resources for dataquality enhancement. Commun. ACM 32, 3 (Mar. 1989), 320–329.

4. Celko, J., and McDonald, J. Don’t warehouse dirty data. Datamation41, 19 (Oct. 1995), 42–53.

5. Cushing, B. A mathematical approach to the analysis and design ofinternal control systems. Accounting Review 49, 1 (Jan. 1974), 24–41.

6. Laudon, K. Data quality and due process in large interorganizationalrecord systems. Commun. ACM 29, 1 (Jan. 1986), 4–18.

7. Morey, R.C. Estimating and improving the quality of information inan MIS. Commun. ACM 25, 5 (May 1982), 337–342.

8. Redman, T. Data Quality: Management and Technology. BantamBooks, New York, 1992.

9. Strong, D., and Miller, S. Exceptions and exception handling in com-puterized information processes. ACM Transactions on Information Sys-tems 13, 2 (Apr. 1995), 206–233.

10. Wand, Y. and Wang, R. Anchoring data quality dimensions ontologi-cal foundations. Commun. ACM 39, 11 (Nov. 1996), 86–95.

11. Wang, R., Storey, V., and Firth C. A framework for analysis of dataquality research. IEEE Transactions on Knowledge and Data Engineering7, 4 (Aug. 1995), 623–640.

12. Wang, R.Y. and Strong, D.M. Beyond accuracy: What data qualitymeans to data consumers. Journal of Management Information Systems12, 4 (Spring 1996), 5–34.

Donald P. Ballou ([email protected]) is an associateprofessor of management science and information systems in theSchool of Business at the State University of New York, Albany.Giri Kumar Tayi ([email protected]) is an associate professor of management science and information systems in theSchool of Business at the State University of New York, Albany.

© 1999 ACM 0002-0782/99/0100 $5.00

c

78 January 1999/Vol. 42, No. 1 COMMUNICATIONS OF THE ACM