explaining differences in multidimensional aggregatessunita/papers/vldb99.pdf · explaining...

Click here to load reader

Upload: doanliem

Post on 27-Mar-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

  • Explaining differences in multidimensional aggregates

    Sunita Sarawagi

    School of Information TechnologyIndian Institute of Technology, Bombay

    Mumbai 400076, [email protected]

    Abstract

    Our goal is to enhance multidimensionaldatabase systems with advanced mining prim-itives. Current Online Analytical Processing(OLAP) products provide a minimal set ofbasic aggregate operators like sum and aver-age and a set of basic navigational operatorslike drill-downs and roll-ups. These operatorshave to be driven entirely by the analysts in-tuition. Such ad hoc exploration gets tediousand error-prone as data dimensionality andsize increases. In earlier work we presentedone such advanced primitive where we pre-mined OLAP data for exceptions, summarizedthe exceptions at appropriate levels, and usedthem to lead the analyst to the interesting re-gions.In this paper we present a second enhance-ment: a single operator that lets the ana-lyst get summarized reasons for drops or in-creases observed at an aggregated level. Thiseliminates the need to manually drill-down forsuch reasons. We develop an information the-oretic formulation for expressing the reasonsthat is compact and easy to interpret. We de-sign a dynamic programming algorithm thatrequires only one pass of the data improv-ing significantly over our initial greedy algo-rithm that required multiple passes. In ad-dition, the algorithm uses a small amount ofPart of the work was done when the author was at IBM

    Almaden Research Center, USA

    Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordirect commercial advantage, the VLDB copyright notice andthe title of the publication and its date appear, and notice isgiven that copying is by permission of the Very Large Data BaseEndowment. To copy otherwise, or to republish, requires a feeand/or special permission from the Endowment.

    Proceedings of the 25th VLDB Conference,Edinburgh, Scotland, 1999.

    memory independent of the data size. Thisallows easy integration with existing OLAPproducts. We illustrate with our prototypeon the DB2/UDB ROLAP product with theExcel Pivot-table frontend. Experiments onthis prototype using the OLAP data bench-mark demonstrate (1) scalability of our algo-rithm as the size and dimensionality of thecube increases and (2) feasibility of getting in-teractive answers even with modest hardwareresources.

    1 Introduction

    Online Analytical Processing (OLAP) [Cod93, CD97]products [Arb, Sof] were developed to help analysts dodecision support on historic transactional data. Logi-cally, they expose a multidimensional view of the datawith categorical attributes like Products and Storesforming the dimensions and numeric attributes likeSales and Revenue forming the measures or cells of themultidimensional cube. Dimensions usually have asso-ciated with them hierarchies that specify aggregationlevels. For instance, store name city state is ahierarchy on the Store dimension and UPC code type category is a hierarchy on the Product di-mension. The measure attributes are aggregated tovarious levels of detail of the combination of dimen-sion attributes using functions like sum, average, max,min, count and variance. For exploring the multidi-mensional data cube there are navigational operatorslike select, drill-down, roll-up and pivot conforming tothe multidimensional view of data. In the past mostof the effort has been spent on expediting these sim-ple operations so that the user can interactively invokesequences of these operations. A typical such analysissequence proceeds as follows: the user starts from anaggregated level, inspects the entries visually maybeaided by some graphical tools, selects subsets to in-spect further based on some intuitive hypothesis orneeds, drills down to more detail, inspects the entriesagain and either rolls up to some less detailed view ordrills down further and so on.

  • The above form of manual exploration can get te-dious and error-prone for large datasets that com-monly appear in real-life. For instance, a typicalOLAP dataset has five to seven dimensions, an av-erage of three levels hierarchy on each dimension andaggregates more than a million rows [Cou]. The an-alyst could potentially overlook significant discoveriesdue to the heavy reliance on intuition and visual in-spection.

    This paper is part of our continued work on takingOLAP to the next stage of interactive analysis wherewe automate much of the manual effort spent in anal-ysis. Recent work in this direction attempt to enhanceOLAP products with known mining primitives: deci-sion tree classifiers to find the factors affecting prof-itability of products is used by Information discoveryInc [Dis] and Cognos [Cor97a], clustering customersbased on buying patterns to create new hierarchies isused in Pilot Software [Sof]; and association rules atmultiple levels of aggregation to find progressively de-tailed correlation between members of a dimension issuggested by Han et al [HF95]. In all these cases, theapproach has been to take existing mining algorithmsand integrate them within OLAP products. Our ap-proach is different in that we first investigate how andwhy analysts currently explore the data cube and nextautomate them using new or previously known oper-ators. Unlike the batch processing of existing miningalgorithms we wish to enable interactive invocation sothat an analyst can use them seamlessly with the ex-isting simple operations.

    In [SAM98] we presented one such operation thatwas motivated with the observation that a significantreason why analysts explore to detailed levels is tosearch for abnormalities or exceptions in detailed data.We proposed methods for finding exceptions and usedthem to guide the analysts to the interesting regions.

    In this paper we seek to automate another areawhere analysts spend significant manual effort explor-ing the data: namely, exploring reasons for why a cer-tain aggregated quantity is lower or higher in one cellcompared to another. For example, a busy executivelooking at the annual reports might quickly wish tofind the main reasons why sales dropped from the thirdto the fourth quarter in a region. Instead of diggingthrough heaps of data manually, he could invoke ournew diff operator which in a single step will do allthe digging for him and return the main reasons in acompact form that he can easily assimilate.

    We explore techniques for defining how to answerthese form of why queries, algorithms for efficientlyanswering them in an ad hoc setting and system issuein integrating such a capability with existing OLAPproducts.

    Product Platform Geography Year Product name (67) Platform name (43) Geography (4) Year (5) Prod_Category (14) Plat_Type (6) Prod_Group (3) Plat_User (2)

    Figure 1: Dimensions and hierarchies of the software rev-enue data. The number in brackets indicate the size of thatlevel of the dimension.

    Plat_User (All)

    Prod_Category (All)Product (All)Plat_Type (All)Prod_Group (All)

    Platform (All)

    Sum of Revenue Year

    Geography 1990 1991 1992 1993 1994Asia/Pacific 1440.24 1946.82 3453.56 5576.35 6309.88Rest of World 2170.02 2154.14 4577.42 5203.84 5510.09United States 6545.49 7524.29 10946.87 13545.42 15817.18Western Europe 4551.90 6061.23 10053.19 12577.50 13501.03

    Figure 2: Total revenue by geography and year. The twoboxes with dark boundaries (Rest of World, year 1990and 1991) indicate the two values being compared.

    1.1 Contents

    We first demonstrate the use of the new diff oper-ator using a real-life dataset. Next in Section 2 wediscuss ways of formulating the answers. In Section 3we present algorithms for efficiently finding the bestanswer based on our formulation. In Section 4 wedescribe our overall system architecture and presentexperimental results.

    1.2 Illustration

    Consider the dataset shown in Figure 1 with four di-mensions Product, Platform, Geography and Timeand a three level hierarchy on the Product and Plat-form dimension. This is real-life dataset obtained fromInternational Data Corporation (IDC) [Cor97b] aboutthe total yearly revenue in millions of dollars for dif-ferent software products from 1990 to 1994. We willuse this dataset as a running example in the paper.

    Suppose a user is exploring the cube at theGeographyYear plane as shown in Figure 2. He no-tices a steady increase in revenue in all cases exceptwhen going from 1990 to 1991 in Rest of the World.

    With the existing tools the user has to find the rea-son for this drop by manually drilling down to the nu-merous different planes underneath it, inspecting theentries for big drops and drilling down further. Thisprocess can get rather tedious especially for the typi-cally larger real-life datasets. We propose to use thenew operator for finding the answer to this questionin one step. The user simply highlights the two cellsand invokes our diff module. The result as shownin Figure 3 is a list of at most 10 rows (the number 10

  • PRODUCT PLAT_USERPLAT_TYPEPLATFORM YEAR_1990 YEAR_1991RATIO ERROR(All)- (All)- (All) (All) 1620.02 1820.05 1.12 34.07Operating Systems Multi (All)- (All) 253.52 197.86 0.78 23.35Operating Systems Multi Other M. Multiuser Mainframe IBM 97.76 1.54 0.02 0.00Operating Systems Single Wn16 (All) 94.26 10.73 0.11 0.00*Middleware & Oth.UtilitiesMulti Other M. Multiuser Mainframe IBM 101.45 9.55 0.09 0.00EDA Multi Unix M. (All) 0.36 76.44 211.74 0.00EDA Single Unix S. (All) 0.06 13.49 210.78 0.00EDA Single Wn16 (All) 1.80 10.89 6.04 0.00

    Figure 3: Reasons for the drop in revenue marked in Figure 2.

    Geography (All)Plat_User (All)Prod_Group Soln

    Sum of Revenue Year

    Prod_Category 1990 1991 1992 1993 1994Cross Ind. Apps 1974.57 2484.20 4563.57 7407.35 8149.86Home software 293.91 574.89Other Apps 843.31 1172.44 3436.45Vertical Apps 898.06 1460.83 2826.90 7947.05 8663.39

    Figure 4: Total revenue for different categories in productgroup Soln. We are comparing the revenues in year 1992and 1993 for product category Vertical Apps.

    is configurable by user) that concisely explain the rea-sons for the drop. In the figure the first row shows thatafter discounting the rows explicitly mentioned belowit as indicated by the - symbol after (All), the overallrevenue actually increased by 10%. The next four rows(rows second through fifth) identify entries that werelargely responsible for the large drop. For instance, forOperating systems on Multiuser Mainframe IBMthe revenue decreased from 97 to 1.5, a factor of 60reduction and for Operating systems and all plat-forms of type Wn16 the total revenue dropped from94.3 to 10.7. The last three rows show cases wherethe revenue showed significantly more than the 10%overall increase indicated by the first row. The role ofthe RATIO and ERROR columns in Figure 3 will beexplained in Section 2.

    Similarly, a user might be interested in exploringreasons for sudden increases. In Figure 4 we show thetotal revenue for different categories in the Productgroup Soln. The user wants to understand the reasonfor the sudden increase in revenue of Vertical Appsfrom 2826 in 1992 to 7947 in 1993 almost a factor ofthree increase. For the answer he can again invoke thediff operator instead of doing multiple drill downsand selection steps on the huge amount of detaileddata. The result is the set of rows shown in Figure 5.From the first row of the answer, we understand thatoverall the revenue increased by a modest 30% afterdiscounting the rows underneath it. The next set ofrows indicate that the main increase is due to prod-uct categories Manufacturing-Process, Other verticalApps, Manufacturing-Discrete and Health care in var-

    ious geographies and platforms. The last two rowsindicate the two cases where there was actually a dropthat is significant compared to the 30% overall increaseindicated by the first row.

    These form of answers helps the user to quicklygrasp the major findings in detail data by simply glanc-ing at the few rows returned.

    2 Problem Formulation

    In this section we discuss different formulations forsummarizing the answer for the observed difference.

    The user poses the question by pointing at two ag-gregated quantities va and vb in the data cube andwishes to know why one is greater or smaller than theother. Both va and vb are formed by aggregating one ormore dimensions. Let Ca and Cb denote the two sub-cubes formed by expanding the common aggregated di-mensions of va and vb. For example, in Figure 2 va de-notes the revenue for cell (All platforms, All products,Rest of world, 1990) and vb denotes the total revenuefor cell (All platforms, All products, Rest of world,1991), Ca denotes the two-dimensional subcube withdimension < Platform, Product > for Year=1990 andGeography=Rest of world and Cb denotes the two-dimensional subcube with the same set of dimensionsfor Year=1991 and Geography=Rest of world. In thisexample, the two cells differ in only one dimension. Wecan equally well handle cases where the two cells differin more than one dimension as long as we have somecommon dimensions on which both the cells are aggre-gated. The answer A consists of a set of rows that bestexplains the observed increase or decrease at the ag-gregated level. There is a limit N (configurable by theuser) on the size of A. The user sets this limit basedon the number of rows the user is willing to inspect.We expect the value of N to be small around a dozenor so.

    A simple approach is to show the large changes indetailed data sorted by the magnitude of the change(detail-N approach). The obvious problem with thissimple solution is that the user could still be left witha large number of entries to eye-ball. In Figure 6 weshow the first 10 rows obtained by this method for thedifference query in Figure 4. This detail-N approachexplains only 900 out of the total difference of about5000. In contrast, with the answer in Figure 5 we could

  • PRODUCT GEOGRAPHYPLAT_TYPE PLATFORM YEAR_1992 YEAR_1993 RATIO ERROR(All)- (All)- (All)- (All) 2113.0 2763.5 1.3 200.0Manufacturing - Process (All) (All) (All) 25.9 702.5 27.1 250.0Other Vertical Apps (All)- (All)- (All) 20.3 1858.4 91.4 251.0Other Vertical Apps United States Unix S. (All) 8.1 77.5 9.6 0.0Other Vertical Apps Western EuropeUnix S. (All) 7.3 96.3 13.2 0.0Manufacturing - Discrete (All) (All) (All) 1135.2 0.0Health Care (All)- (All)- (All)- 6.9 820.4 118.2 98.0Health Care United States Other M. Multiuser Mainframe IBM1.5 10.6 6.9 0.0Banking/Finance United States Other M. (All) 341.3 239.3 0.7 60.0Mechanical CAD United States (All) (All) 327.8 243.4 0.7 34.0

    Figure 5: Reasons for the increase in revenue marked in Figure 4.

    PRODUCT GEOGRAPHY PLATFORM YEAR_1992YEAR_1993RATIOOther Vertical Apps Western Europe Multiuser Minicomputer OpenVMS 99.9 Other Vertical Apps Asia/Pacific Single-user MAC OS 92.5 Other Vertical Apps Rest of World Multiuser Mainframe IBM 88.1 Other Vertical Apps Western Europe Single-user UNIX 7.3 96.3 13.2Other Vertical Apps United States Multiuser Minicomputer Other 97.2 Other Vertical Apps United States Multiuser Minicomputer OS/400 99.5 Other Vertical Apps Asia/Pacific Multiuser Minicomputer OS/400 99.6 EDA Western Europe Multiuser UNIX 192.6 277.8 1.4Manufacturing - DiscreteUnited States Multiuser Mainframe IBM 88.4 Health Care United States Multiuser Minicomputer Other 88.2

    Figure 6: The top ten rows of the detail-N approach.

    Product Manufacturing - ProcessGeography (All)

    Sum of Revenue Year

    Plat_Type 1992 1993Other M. 9.86 472.84Other S. 0.02 21.88Unix M. 7.45 105.09Unix S. 1.31 16.89Wn16 3.27 85.38Wn32 0.38

    Figure 7: Summarizing rows with similar change.

    explain more than 4500 of the total difference. Themain idea behind this more compact representation issummarizing rows with similar changes. In Figure 5we have only one row from the detailed level the restare all from summarized levels. By aggregating overPlatforms and Geography with similar change, a morecompact representation of the change is obtained. Forinstance, in Figure 7 we show the details along differ-ent Platforms for the second row in the answer inFigure 5 (Manufacturing Process, ALL, ALL). No-tice that all the different platform types show similar(though not exact) increase when going from 1992 to1993. Hence, instead of listing them separately in theanswer, we list a common parent of all these rows.The parents ratio, shown as the RATIO column inFigures 3 and 5, is assumed to hold for all its chil-dren. The error incurred in making this assumption is

    indicated by the last column in the figures. In these ex-periments the error was calculated by taking a squareroot of the sum of squares of error of each detailed row.Notice that in both cases, this quantity is a small per-cent of the absolute difference of the values compared.

    The compaction can be done in several differentways. The challenge is in choosing a method thaton the one hand gives higher weightage to changes oflarger magnitude but on the other hand allows sum-marization of rows which have almost similar changes.If we report everything at the most detailed level theerror due to summarization is 0 but the coverage isalso limited. Whereas if we aggregate heavily we canperhaps explain more of the change but the error dueto summarization will also be high. We developed a in-formation theoretic model for cleanly capturing thesetradeoffs.

    2.1 The model

    Imagine a sender wishing to transmit the subcube Cbto a user (receiver) who already knows about subcubeCa. The sender could transmit the entire subcube Cbbut that would be highly redundant because we expecta strong correlation between the values of correspond-ing cells in Ca and Cb. A more efficient method is tosend a compact summary of the difference. The N -row answer A is such a summary. A consists of rowsfrom not only detailed but also aggregated levels ofthe cube. With each aggregated row in the answer,we associate a ratio r that indicates to the user thateverything underneath that row had the same ratio

  • unless explicitly mentioned otherwise. For instance,in Figure 5 all Geography and Platforms for productcategory Health-Care are assumed to have a ratio of118 except for those in United States and PlatformMulti-user mainframe. We need to find A such thata user reconstructing Cb from Ca and A will incur thesmallest amount of error. Intuitively, we can achievethis by listing rows that are significantly different thantheir parents and aggregating rows that are similarsuch that the error due to summarization is minimized.In information theory this error is characterized as thenumber of bits needed to transmit Cb to the user thatalready has Ca and A. The number of bits are calcu-lated using the classical Shannons [CT91] informationtheorem which states that data ~x can be encoded usinga model M that the sender and receiver have agreedon in L(~x|M) log Pr[~x|M ] bits where Pr[~x|M ] de-notes the probability of observing ~x given the model.In our case, data ~x is Cb, the model is Ca and A andthe number of bits is calculated as follows:

    For each detailed row v in CbIf it already appears in A

    the probability is 1 and cost(v) = 0.Else,

    Find its most immediate parent p ALet r be the ratio associated with pThe expected value vb of row v in Cb is rva.Assume a suitable probability distribution (asdiscussed in Section 2.1.1) centered around rvacost(v) = log Pr[vb|rva].

    Increment the total number of bits by cost(v)

    The ratio r associated with any p A is calculatedas follows. When no child of p is in A then the ratior is pb/pa. When a child of p is included in A weremove the contribution of the child to further refinethe estimate of the ratio for children not included inA.

    The goal of the sender is then to pick an answerA that minimizes the total data transmission cost log Pr[Cb|Ca,A].

    In the above cost formulation we have ignored thecost of transmitting the answer set of size N since weassume that it is a user supplied parameter. We caneasily include the cost of transmitting A by assuming asuitable model for transmitting each of its rows. Onesuch model is to sum up the number of bits neededto transmit each attribute of each row as follows: Foreach attribute find the cardinality of the correspond-ing dimension of Ca. Since Ca is known to the user weonly need to transmit an index of the required member.The number of bits needed for the index is: log(ni +1)where ni denotes the cardinality of the ith dimensionof cube Ca and +1 is to add the possibility of sending(All). Once the number of bits needed for calculat-ing an answer set of size N is known, we can selectthe value of N that leads to minimum total numberof bits. Such an approach would be particularly useful

    when the user is at a loss about picking the value ofthe parameter N . In such a case, the algorithm couldstart with a suitably large value of N and finally chopthe answer at a value of N for which the total cost isminimized.

    2.1.1 Probability function

    We use the probability distribution function as ameans to convert the difference between the actualvalue vb and the expected value rva into number ofbits. In real-life it is hard to find a probability dis-tribution function that the data follows perfectly. Wefound that it is not critical to choose a perfect modelfor the data. In choosing a function we only need tomake sure that changes that are significant (contributelarge amounts to the total) get more number of bitsfor the same ratio and rows with slightly different ra-tios can be easily summarized. For instance, if one rowchanges from 1 to 10 and another from 100 to 1000, thesecond change should be considered more interestingeven though the ratio of change is the same. However,simply looking at magnitude is not enough becauselarge numbers with even slightly different ratios wouldhave large difference and it should still be possible tosummarize them to enable inclusion of other changesthat could be slightly smaller in magnitude but havemuch larger ratios.

    For measures representing counts such as totalunits sold the Poisson distribution (with mean rva)provides a good approximation because we can thinkof each transaction as contributing a count of 1 or 0 tothe measure independently of other transactions andthus forming a binomial distribution which we can ap-proximate to Poisson when the number of transactionsis large [Ham77]. For other measures, the normal dis-tribution is often a safe choice and is used extensivelyin business analysis when more specific informationabout the data is lacking [Ham77].

    2.1.2 Missing values

    Another important issue closely tied to the choice ofthe probability distribution function is the treatmentof missing values. These are cells that are empty insubcube Ca but not in Cb and vice versa. For example,for the cubes in Figure 2 there were products thatwere sold in 1990 but were discontinued in 1991. Inmost datasets we considered missing values occurredin abundance, hence it is important to handle themwell.

    One option is to treat a missing value like a verysmall constant close to zero. The trick is to choosea good value of the constant. Choosing too small areplacement for the missing values in Ca will imply alarge ratio and even small magnitude numbers will bereported as interesting. Also rows with missing valuesin Ca and different values vb in Cb will have differentratios. For instance, for a product newly introduced in

  • 1991 all platforms will have a missing value in 1990. Byreplacing that value with the same constant we will getdifferent ratios for different platforms and thus cannoteasily summarize them into a single row.

    A better solution is to replace all missing values asa constant fraction F of the other value i.e., if va ismissing than replace va with vb/F . The main issuethen is picking the right ratio. Too large a ratio willcause even small values of vb to appear surprising asshown in the example below for N = 3. When F =10000 the answer picked is this.

    (All)- (All)- 749 719Western Europe (All)- - 3Western Europe Multiuser UNIX 96 5

    The second row in the answer (where va is missingand vb is 3) is clearly an inferior choice compared tothe row below:

    United States Multiuser Mainframe IBM 60 12

    We handle missing values in a special manner.Whenever we are summarizing rows all of which havemissing values of va we assume an ideal summarizationand assign a cost of 0 due to error. Otherwise, weassume a suitably small value of va depending on thedata distribution. For Poisson distribution we use theLaplace correction [Lap95] and assign va = 1/(vb + 1).For normal distribution we let va = 0 but use avariance of vb/2.

    2.2 Alternative formulations

    We considered a few other alternatives in addition tothe detail-N formulation discussed earlier and the in-formation theoretic approach that we finally used.

    The first alternative was based on the same struc-ture of the answer but used a different objective func-tion. It was formulated as a bicriteria optimizationproblem by associating each aggregated row with twoquantities: a contribution term c that indicated whatpercent of the observed increase or decrease is ex-plained by that row and a sum of squares of errorterm e that measured how much the ratio of its ex-cluded children tuples differed from its ratio. The goalthen was to find an answer with high total contribu-tion and low total error. This alternative turned outto be unsatisfactory for several reasons. First, it washard to formalize the goal. One way was to let the userplace a bound on the error and choose an answer thatmaximizes total contribution within that error bound.But then it was unclear how the user would specify agood error bound. Second, choosing a row based onhigh contribution did not allow compact summariza-tion of the form: except for a single tuple t revenuefor all others dropped uniformly by x% because tuplet actually has a negative contribution and its role inthe answer is to correct the error of its parent.

    The second alternative we considered used a differ-ent structure of the answer. Instead of a flat set of rowsit output a tiny decision tree where tuples with sim-

    ilar ratios were grouped within the same node of thetree. The problem with this approach is that the treeattempts to describe the entire data rather than pickout key summaries as we do in the present approach.This causes the description length to increase.

    3 Algorithm

    Finding an efficient algorithm for this problem is chal-lenging because we wish to use the diff operator in-teractively and it is impossible to exhaustively solvein advance for all possible pair of cells that the usermight possibly ask. Also to allow easy integration withexisting OLAP engines we did not want to use specialpurpose storage structures or indices.

    3.1 Greedy algorithm

    The first cut greedy algorithm that we designed wasas follows:

    Compute all possible aggregates on concatenation ofCa and Cb i.e., cube(Ca|Cb).

    Initialize A with the top-most row of cube(Ca|Cb).This forms the first row of A as shown in Figure 3.For i = 2 to N

    add to A the row from cube(Ca|Cb) whichleads to greatest reduction in number of bits.

    There are two main problems with the above greedyalgorithm. First, it requires as many passes of the dataas the answer size N . Second, there are some commoncases where it fails to get the optimal answer. Weillustrate two such cases.

    Consider the table in Figure 8 showing the revenuesfor different product categories for the data in Figure 1.We want to find the reasons for the revenue jumpingfrom 9.2 million in 1992 to 287 million in 1994 for thecategory System mgmt. . Suppose N = 3 for thisexample. The answer returned by the greedy algo-rithm is shown in Figure 10. According to the top-most row of the answer all Products except StorageManagement and Automated Operations have a ra-tio of 16.3 between their values in 1992 and 1994. InFigure 9 we show the data at the next level of detailby drilling down to the Product dimension. We no-tice that for most products va is missing. Hence, theratio of 16.3 introduces error for most Products likeSecurity Management for which the ratio is almostinfinity. A more forward looking algorithm could singleout the only two products Performance Managementand Problem Management for which the ratio is fi-nite with the result that the final ratio of the top-rowwould be infinite as shown in Figure 11. With thissolution, the total error is close to zero according toour model for handling missing values (Section 2.1.2).The greedy algorithm fails because it cannot predictthe final ratio of the top-most row after removing thecontribution of its divergent children tuples that finally

  • Platform (All)Geography (All)Product (All)

    Sum of Revenue YearCategory 1992 1993 1994Develop. tools 912.100 1287.070 1793.076Info. tools 5.807 6.705 33.420Office Apps 38.405 53.991 97.821System mgmt. 9.233 278.612

    Figure 8: Comparing revenues in 1992 and 1994 forProduct Category System mgmt.

    Category System mgmt.Platform (All)Geography (All)

    Sum of Revenue YearProduct 1992 1994Automated Operations 59.108Change & Configuration Mgmt. 27.905Performance Management 5.515 66.565Problem Management 3.718 35.897Resource Accounting 11.421Security Management 18.265Storage Management 59.451

    Figure 9: Details along different Products of System mgmt.category.

    PRODUCT PLATFORM GEOGRAPHY YEAR_1992 YEAR_1994 RATIO(All)- (All) (All) 9.8 160.1 16.3Storage Management (All) (All) 59.5 Automated Operations (All) (All) 59.1

    Figure 10: Answer returned by the greedy algorithm for the query in figure 8.

    PRODUCT PLATFORM GEOGRAPHY YEAR_1992 YEAR_1994 RATIO(All)- (All) (All) 176.2 Performance Management (All) (All) 5.5 66.6 12.1Problem Management (All) (All) 3.7 35.9 9.7

    Figure 11: A better solution to the difference query in Figure 8.

    get included in the answer. This illustrates a failure ofthe algorithm even in the simple case of a single levelof aggregation.

    The next case illustrates the failure of the greedyalgorithm even when the final ratio can somehow bemagically predicted. This example is on a syntheticdataset with a two level hierarchy on a single dimen-sion as shown in Figure 12. The topmost member mhas optimal ratio 1 and the next intermediate levelhas three members m1, m2,m3 with ratios 2, 1, 1 re-spectively. m1 has one large child m11 with ratio 1and several smaller children with ratio 2.5. Initially,A consists of just the topmost row with ratio 1. Thereduction in the number of bits with adding m1 to Ais very small because of cancellations as follows: Thepredicted ratio for the children of m1 changes from 1to 2.0. While this helps the several small memberswith ratio 2.5, the benefit is canceled because for thelarge child m11 the error increases. The reduction inerror with including any member with ratio 1 is 0. Thereduction with including any of the other children ofm1 with ratio 2.5 is small but that is the best reduc-tion possible and the final answer will include two ofthese children. However, the optimal answer consistsof m1 and its single divergent element m11 with ratio1. This example indicates that another shortcomingof the greedy algorithm is its top-down approach to

    1

    1 12

    2.52.5 2.5 1

    1m 2m 3m

    11m

    m

    Figure 12: An example where greedy algorithm fails.

    processing.These two examples of the failure of the greedy so-

    lution should help the reader appreciate the difficultareas of the problem and it lead us to the dynamicprogramming algorithm described next.

    3.2 Dynamic programming algorithm

    The dynamic programming algorithm eliminates theabove bad cases and is optimal under some assump-tions elaborated next. We present the algorithm inthree stages for ease of exposition. First, we discussthe case of a single dimension with no hierarchies.Next, we introduce hierarchies on that dimension andfinally extend to the general case of multiple dimen-sions with hierarchies.

  • 3.2.1 Single dimension with no hierarchies

    In this case the subcube Cb consists of a one-dimensional array of T real-values. The basic premisebehind the dynamic programming formulation is thatwe can decompose the set of tuples T into two subsetsT and T T , find the optimal solution for each ofthem separately and combine them to get the final an-swer for the full set T . Let D(T, n, r) denote the totalcost (number of bits) for T tuples, answer size n andfinal ratio of top-most row in A as r. Then,

    D(T, n, r) = min0mn

    (D(T T , nm, r) + D(T , m, r)) (1)

    This property enables us to design a bottom-up al-gorithm. We scan the tuples in Ca|Cb at the detailedlevel sequentially while maintaining the best solutionfor slots from n = 0 to n = N . Assume that we magi-cally know the best value of r. Let Ti denote the firsti tuples scanned so far. When a new (i + 1)th tupleti+1 is scanned we update the solution for all (N + 1)slots using the equation above with T = ti+1. Thus,the solution for the slot n of the first (i + 1)th tuplesis updated as

    D(Ti+1, n, r) = min (D(Ti, n 1, r) + D(ti+1, 1, r),D(Ti, n, r) + D(ti+1, 0, r))

    By the cost function in Section 2.1 D(ti+1, 1, r) = 0and D(ti+1, 0, r) = log Pr[vb|rva].

    The best value of r is found by simultaneously solv-ing for different guesses of r and refining those guessesas we progress. At start, we know (as part of the queryparameters) the global ratio rg when none of the tu-ples are included in the answer. We start with a fixednumber R of the ratios around this value from rg/2to 2rg. We maintain a histogram of r values that isupdated as tuples arrive. Periodically, from the his-togram we pick R 1 different r values by droppingup to N extreme values and using the average over themiddle r values. We then select the R most distinctvalues from the R previous values and the R 1 newvalues and use that to update the guess. When thealgorithm ends, we pick the solution corresponding tothat value of r which has the smallest cost. Thus, thefinal solution is:

    D(T,N) = min1iR

    D(T,N, ri)

    3.2.2 Single dimension with hierarchies

    We now generalize to the case where there is a L levelhierarchy on a dimension. For any intermediate hier-archy, we have the option of including in the answerthe aggregated tuple at that level. When we include anaggregate tuple agg(T ), the ratio of all its children notin the answer is equal to the ratio induced by agg(T )instead of the outer global ratio r. The updated costequation is thus:

    N=2

    + tuple i

    min

    N=1

    N=0

    Tuples in detailed data grouped by common parent..

    iN=2

    N=2

    N=1

    N=0

    N=1

    N=0

    Level 0

    Level 1

    N=2

    N=1

    N=0

    A new group formed

    Tuples with same parent

    Figure 13: State of the dynamic programming algorithmfor N = 2 and L = 2 and R = 1.

    D(T,N, r) = min(cost excluding the aggregate tuple,cost including the aggregate tuple)

    D(T, N, r) = min(D(T,N, r) from equation (1),minr(D(T,N 1, r) + agg(T ))) (2)

    The algorithm maintains L different nodes one foreach of the L levels of hierarchy. Each node storespartial solutions for the N + 1 slots as described forthe case of a single hierarchy. In Figure 13 we show anexample state where the answer size N = 2 and thenumber of levels of the hierarchy L = 2. The tuples inCa|Cb are then scanned in an order such that all tupleswithin the same hierarchy appear together. The tuplesare first passed to the bottom-most (most detailed)node of the hierarchy. This node updates its solutionusing Equation 1 until it gets a tuple that is not amember of the current hierarchy. When that happensit finds the final best solution using Equation 2, passesthe solution to the node above it and re-initializes itsstate for the next member of the hierarchy. The nextnon-leaf node on getting a new solution from a nodebelow it updates its solution using the same procedure.Thus, data is pipelined from one level to the next upthe hierarchy. At each level we find the best solutionfor the group of tuples that have the same parent atthat level of hierarchy. The final answer is stored inthe top-most node after all the tuples are scanned.

    Next we discuss how to choose values for the ratiosfor internal nodes. We need to allow for the possibil-ity that a nodes aggregate tuple may not be included.Instead any of its parent up to the root could becomeits immediate parent in the final answer. Hence, we

  • need to solve for ratios of those parent tuples. Lack-ing any further information we start with a fixed setof ratios around the global ratio as in the single hier-archy case. When more tuples arrive at a node, webootstrap towards a good choice of ratio as follows:Each node propagates downwards to its children nodea current best guess of the ratio of its aggregate tu-ple. For tuples that are already through nothing canbe done except re-evaluate costs with the new ratiosbut for subsequent tuples we get progressively betterestimates.

    Lemma 3.1 The above dynamic programming al-gorithm generates the optimal answer if the ratiosguessed at each level are correct.

    Proof. The proof of optimality follows directly fromEquation 2 since the algorithm is a straight implemen-tation of that equation with different ways of decom-posing the tuples into subsets. The optimality of a dy-namic programming algorithm is not affected by howand in how many parts we decompose the tuples aslong as we can find the optimal solution in each partseparately and combine them optimally. For L = 1 thedecomposition is such that the new tuple becomes theT in the equation and the size of T is 1 always. ForL 1, we decompose tuples based on the boundariesof the hierarchy and T consists of all children of themost recent value at that level of the hierarchy.

    3.2.3 Multiple dimensions

    In the general case we have multiple dimensions withone or more levels of hierarchies on each of the di-mensions. We extend to multiple dimensions by pre-ordering the levels of dimension and applying the so-lution of Section 3.2.1. The goal is to pick an orderingthat will minimize the total number of bits. Intuitively,a good order is one where levels that show more sim-ilarity (in terms of ratio) are aggregated earlier. Forexample, for the queries in Figure 2 and Figure 4 onthe data shown in Figure 1 we found the different lev-els of the platform dimension to be more similar thandifferent levels of the product dimension. Therefore,we aggregated on platform first. We formalize this in-tuition by framing in the same information theoreticterms. For each level of each dimension, we calculatethe total number of bits needed if all tuples at thatlevel were summarized to their parent tuple in the nextlevel. For each level l of dimension d we first aggre-gate the values to that level. Let Tld denote the set ofaggregated tuples at that level. We next estimate theerror incurred if each tuple t Tld is approximated bythe ratio induced by its parent tuple at level l 1 asfollows:

    Bld =

    tTld

    D(t, 0, rparent(t))

    OLAPserver

    Excel frontend

    .Stored procedure forcomputing reasons

    Server-side

    Client-sideODBC: CallStoredProc.

    SQL/MDX query

    Cursor to queryresult

    N-row answer

    Figure 14: Our prototype.

    Finally, we sort the levels of the dimension on Bldwith the smallest value corresponding to the lower-most level. Of course, in the final order the more de-tailed levels have to be below the less detailed level ofthe same hierarchy irrespective of the order specifiedby Bld. The intuition behind this form of ordering isthat, if the tuples within a level are similar, the parenttuple will be a good summary for those tuples and thenumber of bits needed to transmit those tuples to areceiver who already knows the parent tuple will besmall. In contrast, if the children of a tuple are verydivergent a larger number of bits will be needed totransmit them.

    4 Implementation

    4.1 Integrating with OLAP products

    Our current prototype is based on IBMs DB2/UDBdatabase (version 5.2) that we view as a ROLAP en-gine. We use SQL to access data such that partialprocessing is done inside the DBMS. In Figure 14 weshow the architecture for our prototype. Our algo-rithm is packaged as a stored procedure that resideson the server side. The client (Microsoft Excel in ourcase) uses ODBC embedded in Visual Basic for in-voking the stored procedure. The N row-answer isstored in the database and accessed by the client usingODBC. The stored procedure generates a giant SQLstatement that subsets only the relevant part of thedata cube. The query involves selecting the specifiedvalues at the specified aggregation level and sortingthe data such that the stored procedure does not haveto do any intermediate caching of the results. In Fig-ure 15 we show the example SQL statement for thedifference query in Figure 2. The data is assumed tobe laid out in a star schema [CD97].

    The stored procedure is a rather light-weight at-tachment to the main server. The indexing and queryprocessing capability of the server is used to do mostof the heavy-weight processing. Thus, the stored pro-cedure itself does not use any extra disk space and theamount of memory it consumes is independent of thenumber of rows. It is O(NLR) where N denotes themaximum answer size, L denotes the number of lev-els of hierarchy and R denotes the number of distinct

  • with subset(productId, platformId, year, revenue) asselect productId, platformId, year, revenue from cubewhere geographyId = (select geographyId from geography where name = Rest of World)

    with cube-A(productId, platformId, v a, v b) asselect productId, platformId, revenue, 0 from subsetwhere year = 1990

    with cube-B(productId, platformId, v a, v b) asselect productId, platformId, 0, revenue from subsetwhere year = 1991

    select prod groupId, prod categoryId, productId, plat userId, plat typeId,platformId, sum(v a), sum(v b)from (select * from cube-A) union all (select * from cube-B)group by prod groupId, prod categoryId, productId, plat userId, plat typeId, platformIdorder by prod groupId, prod categoryId, productId, plat userId, plat typeId, platformId

    Figure 15: SQL query submitted to the OLAP server.

    ratio values tried by the algorithm. This architectureis thus highly scalable in terms of resource require-ments. We are also building a second prototype usingthe emerging OLE DB for OLAP [Mic98b] standardfor integrating with any OLAP engine that supportsthe API.

    4.2 Performance evaluation

    In this section we present an experimental evaluationon our prototype to demonstrate the (1) feasibility ofgetting interactive answers on typical OLAP systemsand the (2) scalability of our algorithm as the size anddimensionality of the cube increases.

    All the experiments were done on a IBM PC with a333 MHz Intel processor, 128 MB of memory and run-ning Windows NT 4.0. We used the following datasets.

    Software revenue data: This is a small datasetbut is interesting because it is real-life data about therevenues of different software products from 1990 to1994. We discussed this dataset earlier in Section 1.2and repeat the schema here for convenience. The num-bers within bracket denote the cardinality of that level.

    Product Platform Geography Year Product name (67) Platform name (43) Geography (4) Year (5) Prod_Category (14) Plat_Type (6) Prod_Group (3) Plat_User (2)

    OLAP Council benchmark [Cou]: This datasetwas designed by the olap council to serve as a bench-mark for comparing performance of different OLAPproducts. It has 1.36 million total non-zero entriesand four dimensions: Product with a seven hierarchy,Customer with a three level hierarchy, Channel withno hierarchy and Time with a four level hierarchy asshown in the figure below.

    Product Customer Channel TimeCode (9000) Store (900) Channel (9) Month (17) Class (900) Retailer (90) Quarter (7) Group (90) Year (2) Family (20) Line (7) Divison (2)

    Grocery sales data: This is a demo dataset ob-tained from the Microsoft DSS product [Mic98a]. Ithas 250 thousand total non-zero entries and consistsof five dimensions with hierarchies as shown below.Store Customer Product Promotion TimeName (24) City (109) Name (1560) Media type (14) Month (24) State (10) State (13) Subcategory (102) Quarter (8) Country (3) Country (2) Category (45) Year (2)

    Department (22) Family (3)

    We establish the computational feasibility of an-swering online why queries. Unlike conventional datamining algorithms, we intend this tool to be used in aninteractive manner hence the processing time for eachquery should be bounded. In most cases, although theentire cube can be very large, the subset of the cube ac-tually involved in the processing is rather small. Whenthat is not true we bound the processing time by lim-iting the level of detail from which we start. Whenthe server is first initialized it collects statistics of thenumber of tuples at various levels of detail and usesthat to determine the level of detail from which theprocessing should start.

    The queries for our experiments are generated byrandomly selecting two cells from different levels of ag-gregation of the cube. There are three main attributesof the workload that affect processing time:

    The number of tuples in the query result (size ofCa|Cb).

    The total number of levels in Ca|Cb that deter-mines the number of nodes (L) in the dynamicprogramming algorithm.

    The answer size N that determines the number

  • 0

    10

    20

    30

    40

    50

    60

    70

    80

    0 50000 100000 150000 200000 250000 300000 350000

    Number of tuples in subcube

    Tim

    e in

    sec

    onds

    Alg Total DataAccess

    Figure 16: Total time taken as a function of subcube size

    of slots per node. The default value of N in ourexperiments was 10.

    We report on experimental results with varying valuesof each of these three parameters in the next threesections.

    4.2.1 Number of tuples

    We chose ten arbitrary queries within two levels of ag-gregation from the top. In Figure 16 we plot the totaltime in seconds for each query sorted by the numberof non-zero tuples in the subcube (Ca|Cb).

    We show three graphs: The first one denotes thedata access time which includes the time from the is-sue of the query to returning the relevant tuples to thestored procedures. The second curve denotes the timespent in the stored procudure for finding the answer.The third curve denotes the sum of these two times.From Figure 16 we can make the following observa-tions:

    The total time taken for finding the answer is lessthan a minute for most cases. Even for the sub-cube with quarter million entries the total time isonly slightly over a minute.

    Only a small fraction < 20% of the total time isspent in the stored procedure that implements thediff operator. The majority of the time is spentin subseting the relevant data from the databaseand passing to the stored procedure. This impliesthat if we used a server better optimized for an-swering OLAP queries the processing time couldbe even further reduced.

    The subset of the data actually relevant to thequery is often very small even for fairly large sized

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8

    Number of levels

    Tim

    e in

    sec

    onds

    Figure 17: Total time taken versus number of levels in theanswer

    datasets. For example, the OLAP-benchmarkdataset has 1.37 million total tuples but thelargest size of the subcube involving two compar-isons along the month dimension is only 81 thou-sand for which the total processing time is only 8seconds.

    4.2.2 Number of levels

    In Figure 17 we show the processing time as a func-tion of the number of levels of aggregation. As weincrease the number of levels, for a fixed total numberof tuples, we expect the processing time to increase,although at a slower than linear rate because signifi-cantly more work is done at lower levels than higherlevels of the node. The exact complexity depends alsoon the fanout of each node of the hierarchy. In Fig-ure 17 we observe that as the number of levels is in-creased from 2 to 7 the processing time increases from6 to 9 seconds for a fixed query size of 70 thousandtuples. This is a small increase compared to the totaldata access time of 40 seconds.

    4.2.3 Result size N

    In Figure 18 we show the processing time as a functionof the answer size N for two different queries: query 1with 8 thousand tuples and query 2 with 15 thousandtuples. As the value of N is increased from 10 to 100the processing time increases from 2.7 to 6 secondsfor query 1 and 3.1 to 9 seconds for query 2. Whenwe add the data access time to the total processingtime, this amounts to less than a factor of 2 increase intotal time. The dynamic-programming algorithm hasa O(N2) dependence on N but other fixed overheads

  • 0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 20 40 60 80 100 120

    Result size (N)

    Tim

    e in

    sec

    onds

    Query 1 Query 2

    Figure 18: Total time taken as a function of result size(N)

    dominant the total processing time more than the corealgorithm. That explains why even when we increaseN from 1 to 100, the processing time increases by lessthan a factor of 5.

    5 Conclusion

    In this paper we introduced a new operator for enhanc-ing existing multidimensional OLAP products withmore automated tools for analysis. The new opera-tor allows a user to obtain in one step summarizedreasons for changes observed at the aggregated level.

    Our formulation of the operator allows key reasonsto be conveyed to the user using a very small numberof rows that (s)he can quickly assimilate. By castingin information theoretic terms, we obtained a cleanobjective function that could be optimized for gettingthe best answer. We designed a dynamic programmingalgorithm that optimizes this function using a singlepass of the data and consuming very little memory.This algorithm is significantly better than our initialgreedy algorithm both in terms of performance andquality of answer.

    We prototyped the operator on the DB2/OLAPserver using an excel front-end. Our design enablesmost of the heavy-weight processing involving index-lookups and sorts to be pushed inside the OLAPserver. The extension code needed to support the op-erator does relatively smaller amount of work. Ourdesign goal was to enable interactive use of the new op-erator and we demonstrated through experiments onthe prototype. Experiment using the industry OLAPbenchmark indicate that even when the subcubes de-fined by the diff query includes a quarter million tu-ples we can process them in a minute. Our experi-ments also show the scalability of our algorithm as the

    number of tuples, number of levels of hierarchy andthe answer size increases.

    In future we wish to design more operators of thisnature so as to automate more of the existing tediousand error-prone manual discovery process in multidi-mensional data.

    References

    [Arb] Arbor Software Corporation, Sunnyvale,CA. Multidimensional Analysis: ConvertingCorporate Data into Strategic Information.http://www.arborsoft.com.

    [CD97] S. Chaudhuri and U. Dayal. An overview of datawarehouse and OLAP technology. ACM SIG-MOD Record, March 1997.

    [Cod93] E. F. Codd. Providing OLAP (on-line analyt-ical processing) to user-analysts: An IT man-date. Technical report, E. F. Codd and Asso-ciates, 1993.

    [Cor97a] Cognos Software Corporation. Power play5, special edition. http://www.cognos.com/powercubes/index.html, 1997.

    [Cor97b] International Data Corporation. http://www.idc.com, 1997.

    [Cou] The OLAP Council. The OLAP benchmark.http://www.olapcouncil.org.

    [CT91] Thomas M Cover and Joy A Thomas. Elementsof Information Theory. John Wiley and Sons,Inc., 1991.

    [Dis] Information Discovery. http://www.datamine.inter.net/.

    [Ham77] M. Hamurg. Statistical analysis for decision mak-ing. Harcourt Brace Jovanovich, Inc, New York,1977.

    [HF95] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. InProc. of the 21st Intl Conference on Very LargeDatabases, Zurich, Switzerland, September 1995.

    [Lap95] P-S Laplace. Philosophical Essays on Probabil-ities. Springer-Verlag, New York, 1995. Trans-lated by A. I. Dale from the 5th French editionof 1825.

    [Mic98a] Microsoft corporation. Microsoft decision sup-port services version 1.0, 1998.

    [Mic98b] Microsoft Corporation, http://www.microsoft.com/data/oledb/olap/spec/. OLE DB forOLAP version 1.0 Specification., 1998.

    [SAM98] Sunita Sarawagi, Rakesh Agrawal, and NimrodMegiddo. Discovery-driven exploration of OLAPdata cubes. In Proc. of the 6th Intl Conferenceon Extending Database Technology (EDBT), Va-lencia, Spain, 1998. expanded version availablefrom http://www.almaden.ibm.com/cs/quest.

    [Sof] Pilot Software. Decision support suite. http://www.pilotsw.com.

    http://www.cognos.com/powercubes/index.htmlhttp://www.cognos.com/powercubes/index.htmlhttp://www.idc.comhttp://www.idc.comhttp://www.olapcouncil.orghttp://www.datamine.inter.net/http://www.datamine.inter.net/http://www.microsoft.com/data/oledb/olap/spec/http://www.microsoft.com/data/oledb/olap/spec/http://www.almaden.ibm.com/cs/questhttp://www.pilotsw.comhttp://www.pilotsw.com

    IntroductionContentsIllustration

    Problem FormulationThe modelProbability functionMissing values

    Alternative formulations

    AlgorithmGreedy algorithmDynamic programming algorithmSingle dimension with no hierarchiesSingle dimension with hierarchiesMultiple dimensions

    ImplementationIntegrating with OLAP productsPerformance evaluationNumber of tuplesNumber of levelsResult size $N$

    Conclusion