vakpublicatie multidimensional business databases protect

International Conferenceon Industrial Engineering and Systems Management

IESM 2007May 30 - June 2

BEIJING - CHINA

Explanation of exceptional values in

multidimensional business databases ?

With a case study on the analysis of vehicle criminality data

E.A.M. Caron, A. Veenstra

Erasmus University Rotterdam, Erasmus Research Institute of Management, P.O. Box 1738, 3000 DR,Rotterdam, The Netherlands

Abstract

In this paper, we describe an extension of the OnLine Analytical Processing (OLAP) framework with causal expla-nation, offering the possibility to automatically generate explanations for exceptional cell values. This functionalitycan be built into conventional OLAP databases using a generic explanation formalism, which supports the workof managers in diagnostic processes. The central goal is the identification of specific knowledge structures andreasoning methods required to construct computerized explanations from multi-dimensional data. The extendedmethodology was tested on a case study that involves the analysis of an OLAP data set on stolen vehicles in TheNetherlands. The findings suggest that improved decision-making by managers is possible because automatedproblem identification and explanation generation enhances the current tedious and error-prone manual analysisprocess. It is also noted that this novel methodology has general utility for decision support systems.

Key words: Decision support systems, OLAP, Multidimensional databases, Explanation, Data mining

1 Introduction

Today’s OLAP systems have limited explanation or diagnostic capabilities. The diagnostic process is nowcarried out mainly manually by business analysts, where the analyst explores the multidimensional datato spot exceptions visually, and navigates the data with operators like drill-down, roll-up, and selectionto find the reasons for these exceptions. It is obvious that human analysis can get problematic and error-prone for large data sets. The main objective of our current research is to largely automate these manualdiagnostic discovery processes [1]. This functionality can be provided by extending the conventionalOLAP system with an explanation formalism, which supports the work of human decision makers indiagnostic processes. Here diagnosis is defined as finding the best explanation of unexpected behaviour(i.e., symptoms) of a system under study [18]. This definition captures the two tasks that are central inproblem diagnosis, namely problem identification and explanation generation. The rationale behind thispaper is to extend the methodology for automated diagnosis as described in [1–3] and its applications.Firstly, we extend the explanation methodology with a procedure to deal with so-called cancelling-out orneutralisation effects in data sets. A neutralisation effect is the phenomenon that the effects of two or morelower-level variables may cancel each other out in a system of equations, so that their (joint) influenceon a higher-level is partially or fully neutralized. If one starts diagnosis with the method described in [3]

1 This paper was not presented at any other revue. Corresponding author E.A.M. Caron

Email addresses: [email protected] (E.A.M. Caron), [email protected] (A. Veenstra).

IESM 2007, BEIJING - CHINA, May 30 - June 2

cancelling-out effects are not identified. However, these effects are quite common in business data setsand could lead to results in the form incomplete explanation trees. Secondly, we introduce the extendedmethodology for diagnosis in the application domain of multidimensional databases.

OLAP systems are a popular business intelligence technique in the field of enterprise information systemsfor decision support. OLAP implementations typically employ a star schema, which stores data de-normalized in fact tables and dimension tables. The fact table contains mappings to each dimensiontable, along with the actual measured data. A star model representing a multidimensional database withvehicle criminality data is shown in Fig. 2. In a star scheme data is organized using the dimensionalmodeling approach, which classifies data into measures and dimensions. Measures such as, for example,sales, profit, and costs figures, are the basic units of interest for analysis. Dimensions correspond todifferent perspectives for viewing measures. Example dimensions are a product or a time dimension.Dimensions are usually organized as dimension hierarchies, which offer the possibility to view measuresat different dimension levels. In the central fact table of the example star model Fig. 2, the table “Vehiclecrime facts”, the measures of the data set are listed. The vehicle criminality data set has 5 dimensionsDate vehicle stolen, Vehicle classification, etc., and all dimensions have at least a three level hierarchy.Aggregating measures up to a certain dimension level creates a multidimensional view of the data, alsoknown as the data cube. In Fig. 1 an example data cube is shown representing a multidimensional databasewith vehicle criminality data.

Fig. 1. Example cube from a database with vehicle criminality data

Our discussion on diagnostic reasoning is largely based on Feelders and Daniels’ notion of explanationsin [2,3], which is essentially based on Humphreys’ notion of aleatory explanations [9] and the theoryof explaining differences by Hesslow [7]. Causal influences can appear in two forms: contributing andcounteracting. The canonical format for causal explanations is taken from [2,3]:

〈a, F, r〉 because C+, despite C−, (1)

where 〈a, F, r〉 is the event to be explained, C+ is non-empty set of contributing causes, and C− a(possibly empty) set of counteracting causes. The explanation itself consists of the causes to which C+

jointly refers. C− is not part of the explanation, but gives a clearer notion of how the members of C+

actually brought about the symptom. In words, the explanandum is a three-place relation between anobject a (e.g. the ABC-company) that shows the actual behaviour of a company, a property F (e.g.having a low profit) that shows the deviation for a particular variable from its norm value and a referenceclass r (e.g. other companies in the same branch or industry) that shows the norm behaviour. The task isnot to explain why a has property F, but rather to explain why a has property F when the other membersof r do not. For example, when r is selected as the statistically normal, the explanatory cause must beabnormal. This general formalism for explanation constitutes the basis of the framework for diagnosis inthe OLAP context.

Important related research work in the domain of multidimensional databases is the work by Sarawagiet al. on problem identification [14] and explanation generation [15] in multidimensional databases. In[14] the authors developed a discovery-driven exploration paradigm that mines the multidimensional datafor exceptions and summarizes the exceptions at appropriate levels in advance by applying a log-linear


statistical model. Our method for problem identification is quite similar, however our approach is based onthe multi-way ANOVA model. In [15] Sarawagi presented an explanation operator for multidimensionaldata that lets the analyst get summarized reasons for drops or increases observed at an aggregated level.This operator is not based on a causal model of explanation, resulting in problems with finding clearparameters for their algorithms. Moreover, norm values in this approach are not pre-computed by astatistical model but are typically historical norm values.

The remainder of this paper is organized as follows. Section 2 introduces our notation for the multidimen-sional model, followed by a description of normative models appropriate for OLAP problem identificationin section 3. In section 4 the explanation formalism is extended for multidimensional data in order toautomatically generate explanations, and in section 5 the complete method is illustrated in a case studyon the analysis of vehicle criminality data. Finally, conclusions are discussed in section 6.

2 Notation and equations

Many different notations and definitions of multidimensional data schemata can be found in the literature[11,17]. Here we introduce a generic notation that is particularly suitable for combining the concepts ofmeasures, dimensions, and dimension hierarchies. Therefore, we define a measure y as a function onmultiple domains:

yi1i2...in : Di11 ×Di2

2 × . . .×Dinn → R. (2)

Each domain Dk has a number of hierarchies ordered by Dimaxk ≺ Dimax−1

k ≺ . . . ≺ D0k, where D0

k isthe highest level and Dimax

k is the lowest level in Dk. The top level of a dimension has a single levelinstance D0

k = {All}. In the case study database we have the following hierarchy for the Time di-mension T 3 ≺ T 2 ≺ T 1 ≺ T 0, where T 0 = {All-Times}, T 1 = {2000, 2001}, T 2 = {Q1,Q2,Q3,Q4},and T 3 = {Jan,Feb,. . . ,Dec}. Sometimes we will write, for example, T[Quarter] for T 2. A cell is de-noted by (d1, d2, . . . , dn), where the di’s are elements of the domain hierarchy at some level, so forexample (1999,1???,Passenger cars) is a cell in the data cube shown in Fig. 1. Each cell contains data,which are the values of the measure y like, for example, # stolen vehicles121(1999,1???,Passenger cars).The measure’s upper indices indicate the level on the associated dimension hierarchies. For example,# stolen vehicles121 is a measure on dimension levels T 1 (=T[Year]), L2 (=L[Postal code pos 1]), andV 1 (=V[Vehicle category]). If no confusion can arise we will leave out the upper indices and write# stolen vehicles(1999,1???,Passenger cars). Furthermore, the combination of a cell and a measure iscalled a data point. The measure values at the lowest level cells are entries of the base cube. If a measurevalue is on the base cube level, then the hierarchies of the domains can be used to aggregate the mea-sure values by the usual aggregation operators like SUM(), COUNT(), or AVG(). By applying suitableequations, we can alter the level of detail and map low level cubes to high level cubes and vice versa. Forexample, aggregating measure values along the dimension hierarchy (rollup) creates a multi-dimensionalview on the data, and de-aggregating the measures on the data cube to a lower dimension level (drill-down), creates a more specific cube. The slice operation performs a selection on one dimension of a givencube, resulting in a new (sub) cube. The reverse of a slice (un-slice) operation performs a “de-selection”on one or more dimensions, resulting in a more general cube. This general cube is defined as the contextof a cell.

Here we investigate the common situation where the aggregation operator is the summarization of mea-sures in the dimension hierarchy. So y is an additive measure [10] if in each dimension and hierarchy levelof the data cube:

yi1...iq−1...in(. . . , a, . . .) =∑J

j=1yi1...iq...in(. . . , aj , . . .) (3)

where a ∈ Dq−1i , aj ∈ Dq

i , q is some level in the dimension hierarchy, and J represents the number oflevel instances in Dq

i . For example,

# stolen vehicles121(2001,Postal code pos 1,Heavy trucks) =∑4j=1

∑10k=1 # stolen vehicles231(2001.Qj ,Postal code pos 2k,Heavy trucks)


is the equation for two roll-up’s. If there is no confusion about the level in the dimension hierarchy wewill use y(+,+, . . .), for the left-hand side (LHS) of the above equation. In this way a data cube withonly two dimensions is represented by a table where the row totals are given by y(d1,+), column totalsare given by y(+, d2), and the grand total is given by y(+,+).

Furthermore, we assume that a business model M is given representing relations between measures. Theserelations can be derived from many domains, like finance, accounting, logistics, and so forth. Relationsare denoted by:

yi1i2...in(d1, . . . , dn) = f(xi1i2...in(d1, . . . , dn)) (4)

where x = (x1, . . . , xn), and y are measures defined on the same domains. Business model equationsusually hold on equal aggregation levels in the data cube, therefore we may leave out upper indices if noconfusion can arise. In the vehicle criminality data cube we have, for example, the following quantitativerelations:

# non recovered vehicles = # stolen vehicles − # recovered vehicles andrecovery ratio = # recovered vehicles / # stolen vehicles.

3 Normative models

Diagnosis is preceded by the problem identification phase. This phase is based on Pounds [12] conclu-sion that managers define a “problem” as an unexpected deviation from some standard or expectation.Basically, problem identification is a comparison activity carried out by managers, in which data (i.e.business performance variables such as sales volume, profit, market share, return on investment, etc.) arecompared in order to detect discrepancies. A problem (symptom) occurs when a variable is outside apredetermined acceptable range or threshold. If the discrepancy is significant, it is viewed as a symptomthat must be explained. This definition is, for example, employed in the widely used technique of varianceanalysis in accounting systems, so the approach is applicable to a multitude of management problems ina wide variety of organisations. The expected behaviour of the system (e.g. a financial, sales, or accoun-tancy model) is usually defined by goals that have been formulated by management. In other words, themanager has expectations based on some normative model against which reality is measured. It was foundthat managers use several types of normative models to define their goals. These models could be basedon either trends past or projected, comparable situations inside or outside the organization (e.g. industryaverages), or plans and budgets. In addition, managers also apply more abstract normative models inthe form of mathematical or statistical models, to compute or estimate the expected value of importantdecision-making variables (e.g. in forecasting future sales). The normative model specifies the referenceobject r that should be used for comparison.

3.1 Problem identification in OLAP

There are many ways to construct reference objects for multidimensional data. The simplest way ispairwise comparison between two cells [15]. In general, only the cells on the same aggregation levels willbe used for obvious reasons, like the measurement scale of the variable. For example, we can compare# stolen vehicles(2006,Netherlands,Passenger cars) with the number of stolen cars of the previous year,norm(# stolen vehicles(2005,Netherlands,Passenger cars)), as an historical norm value. Other commonnorm values are the average (y) of a cell computed using a context of the cell:

y(. . . , +, . . .) =1J

∑J

j=1y(. . . , aj , . . .), (5)

and for the average over all domains we write y(+,+, . . . ,+). Expected values are based on statisticalmodels. A huge variety of statistical models exists for two-way tables, three-way tables, etc., see Scheffe[16] and Tukey in [8]. Here we only consider two models namely the additive multi-way ANOVA modelfor continuous data and the model of independence for discrete category data. For a continuous data set,in the situation of only two dimensions, we can write the expected value as an additive function of threeterms obtained from the possible aggregates of the table:

y(d1, d2) = y(d1,+) + y(+, d2)− y(+,+). (6)


Where we assume that the joint contribution of the aggregates is the sum of the separate contributionsfrom each aggregate and y = y + ε where ε(d1, d2) ∼ N(0, σ2). The residual of a model is defined as∂y = y − yr = y − y. If we normalize the residual of the model by the standard deviation of the cell, weget the normalized residual:

s =∂y

σ, (7)

where y is computed with the same statistical model applied to a certain context of the cell and σ is thestandard deviation in the same context. The problem of looking for exceptional cell values is equivalentto the problem of looking for exceptional normalized residuals. Problem identification in OLAP is theprocess that computes a value h(ya, yr) for each data point, where h is some user-specified function suchas percentage difference or absolute difference. The actual data point is ya and yr is the reference object.When a statistical model is used as a normative model yr = y. Furthermore, the larger the absolute valueof the normalized residual, the more exceptional a cell is. A data point is a symptom or surprise value[14] if s is higher than some user-defined threshold δ. Automatically, the following series of statistic testsis performed on each variable in the business model, by the diagnostic program, to detect symptoms inthe data set under consideration:

• if ∂y/σ > δ (one-tailed test) then the cell is labelled ∂y = “high”,• if ∂y/σ < −δ (one-tailed test) then the cell is labelled ∂y = “low” and• if −δ ≤ ∂y/σ ≤ δ then the cell is labelled ∂y = “normal”.

Typically, we select δ = 1.645 corresponding to a probability of 95% in the standard normal distribution.Because the actual and reference objects in the data cube will be clear from the selected context, wenow use the following explanation format: ∂y = q occurred because C+, despite C−. In this expression,∂y = q (q ∈ {low,high}) specifies an event in the data cube, i.e. the occurrence of a quantitative differencebetween the actual and the reference value of y, denoted by ya and yr, respectively.

4 Explanation and OLAP

4.1 Influence measure

If ∂y = q is identified as an exceptional cell value in some context, we want to explain the difference∂y = ya − yr based on the internal structure of the data cube. An explanation can therefore be given byapplying equations from the business model or equations from the associated dimension hierarchies. Todetermine the contributing and counteracting causes that explain the quantitative difference between theactual and reference value of y in the business model, a measure of influence is defined in the literature[3,2] as follows:

inf(xi, y) = f(xr−i, x

ai )− yr, (8)

where f(xr−i, x

ai ) denotes the value of f(x) with all variables evaluated at their reference values, except

the measure xi. In words, inf(xi, y) indicates what the difference between the actual and reference value ofy would have been if only xi would have deviated from its reference value [3]. The inf-measure representsa form of ceteris paribus reasoning where the xi’s play the role of causes that produced y [6]. Here it isassumed that ya = f(xa

1 , xa2 , . . . , xa

n) and yr = f(xr1, x

r2, . . . , x

rn) to enforce consistency. Moreover, when

we assume that y is an additive measure, we can simplify the inf-measure for the determination of causesin the dimension hierarchies of the cube. Accordingly, it follows from (3) that:

inf(yi1...iq...in(. . . , aj , . . .), yi1...iq−1...in(. . . , a, . . .)) =

ya;i1...iq...in(. . . , aj , . . .)− yr;i1...iq...in(. . . , aj , . . .).(9)

The definition of the inf-measure enables us to the operationalise the concepts of contributing and coun-teracting causes. When explanation is supported by a business model equation the set of contributing(counteracting) causes C+ (C−) consists of measures xi of the business model with: inf(xi, y) × ∂y > 0(< 0). In words, the contributing causes are those variables whose influence values have the same sign


as ∂y, and the counteracting causes are those variables whose influence values have the opposite sign.If explanation is supported by the dimension hierarchy equation, the set of contributing (counteracting)causes C+ (C−) consists of the set of child instances aj on dimension level iq out of the hierarchy of aspecific dimension with inf(y...iq...(. . . , aj , . . .), y...iq−1...(. . . , a, . . .))×∆y > 0 (< 0).

The correct interpretation of the inf-measure depends on the form of the function f ; the function hasto satisfy the so-called conjunctiveness constraint [2,3]. This constraint captures the intuitive notionthat the influence of a single variable should not turn around when it is considered in conjunctionwith the influence of a number of other variables. An equation satisfies the conjunctiveness constraintif for all subsets X ⊆ {x1, . . . , xn}\{xi}: inf(xi, y) ≥ 0 ⇒ inf(X ∪ {xi} , y) ≥ inf(X, y), inf(xi, y) ≤0 ⇒ inf(X ∪ {xi} , y) ≤ inf(X, y). Two large classes of functions satisfy the conjunctiveness constraint,namely additive and monotonic functions as described in [3]. By monotonicity we mean the monotonicityin all variables separately, on the domain under consideration. Additivity and monotonicity can also beeasily checked. Moreover, there is a natural way to construct reference objects for the RHS variables ofequations (3) and (4). The basic idea is that the context and statistical model selected as reference objectfor the LHS determines the reference objects for the variables on the RHS. In addition, if some RHSvariable appears in a relation on the LHS, the construction of reference objects can be continued for theRHS variables of this equation following the same principle. In this way, a chain of reference objects isformed in the explanation generation process.

4.2 Reducing the number of explanations

Because every applicable equation yields a possible explanation, the number of explanations generatedfor a single symptom can be quite large. Especially when explanations are chained together to form a treeof explanations we might get lost in many branches. In order to leave insignificant influences out of theexplanation we introduce three methods. Firstly, in the problem identification phase the analyst distillsout a set of symptoms. This means that if a cell does not have a large deviating value – based on somestatistical model or defined by a user – it is not identified as a symptom and therefore not considered forexplanation generation. Secondly, insignificant influences are left out in the explanation by means of afilter measure. This measure reduces the set of causes reduced to the so-called parsimonious set of causes[2,3]. The parsimonious set of contributing causes C+

p is the smallest subset of the set of contributingcauses C+, such that its influence on y exceeds a particular fraction (F+) of the influence of the completeset, we can write this as inf(C+

p , y)/ inf(C+, y) ≥ F+. The fraction F+ is a user-defined number between0 and 1, and will typically be 0.85 (corresponding to an explanation of 85% of the observed difference) orso. The definition regarding parsimonious counteracting causes is similar. Often the number of causes willincrease when the fraction is chosen closer to 1. Therefore, the value for the fraction has to be determinedempirically by the analyst by adapting the value to the internal structure of the OLAP database. Byinspecting the generated explanation trees iteratively the analyst makes a selection between significantand insignificant causes. In our experiments with OLAP data sets it was found that fractions with valuesbetween 0.7 and 1 are often appropriate. A third way to reduce the number of explanations is by applyinga measure of specificity for each applicable equation. This measure quantifies the “interestingness” of theexplanation step. The measure is defined as:

specificity =# possible causes# actual causes

. (10)

The number of possible causes is the number of RHS elements of each equation, and the number of actualcauses is the number of elements in the parsimonious set of causes. In general, we prefer explanationsteps with a relatively high specificity value. Using this measure we can order the explanation paths fromspecific to general and if desired only list the most specific steps.

4.3 Multi-level explanation

The explanation generation process for multidimensional data is quite similar to the knowledge miningprocess at multiple dimension levels. Especially, the idea of progressive deepening [5] is very “natural” inthe explanation generation process; start symptom detection on an aggregated level in the data cube andprogressively deepen it to find the causes for that symptom at lower levels of the dimension hierarchy orbusiness model. This idea we will adopt for so-called multi-level explanations. In the previous parts, we


have discussed “one-level” explanations; explanations based on a single relation from the business modelor dimension hierarchy. For diagnostic purposes, however, it is meaningful to continue an explanation of∂y = q, by explaining the quantitative differences between the actual and norm values of its contributingcauses. In multi-level explanation this process is continued until a parsimonious contributing cause isencountered that cannot be explained further because: the business model equations do not contain anequation in which the contributing cause appears on the LHS, or the dimension hierarchies do not containa drill-down equation in which the contributing cause appears on the LHS.

The result of this process is an explanation tree of causes, where y is the root of the tree with twotypes of children, corresponding to its parsimonious contributing and counteracting causes respectively.A node that corresponds to a parsimonious contributing cause is a new symptom that can be explainedfurther, and a node that corresponds to a parsimonious counteracting cause has no successors. In theexplanation tree there are numerous explanation paths from the root to the leaf nodes. This implies thatmany different explanations can be generated for a symptom. In most practical cases one would thereforeapply the pruning methods discussed above yielding a comprehensive tree of the most important causes,by judgement of the analyst. In addition, because multi-level explanation usually starts on an aggregatedlevel in the data cube, sparsity is not an important issue. Naturally, only cells that are filled with dataare candidates for multi-level explanation.

4.4 Hidden causes and cancelling-out effects

The phenomenon that the effects of two or more lower-level variables in the dimension hierarchy orbusiness model cancel each other out, so that their joint influence on a higher-level variable in thebusiness model is partly or fully neutralized, is quite common in multidimensional data sets. For example,the first half-year positive financial results could partially cancel out the negative financial results ofthe next half-year for some business unit or company. If the analyst starts diagnosis with standardmulti-level explanation on the aggregated year level these effects are not identified. For the top-downexplanation generation process described in paragraph 4.3 this means that in some data sets possiblesignificant causes for a symptom will not be detected when cancelling-out effects are present. Thesenon-detected causes by multi-level explanation are called hidden causes. In theory, cancelling-out effectsmay occur at every level in the dimension hierarchy. Of course, analysts would like to be informedabout significant hidden causes, and would consider an explanation tree without mentioning these causesas incomplete and not accurate. Here a multi-step look-ahead method is applied for detecting hiddencauses. In short, the look-ahead method is composed of two consecutive phases: an analysis (1) and areporting phase (2). In the analysis phase the explanation generation process starts, similar as for maximalexplanation, with the root equation in the dimension hierarchy by determining parsimonious causes.However, instead of proceeding with strictly parsimonious causes, all non-parsimonious contributing andcounteracting causes are investigated for possible cancelling-out effects at a specific (lower) level in thehierarchy. In multi-step look-ahead, a successor of variable yi1...iq...in(. . . , aj , . . .) is a hidden cause ifits influence on yi1...iq−1...in(. . . , a, . . .) is significant after substitution, when the influence of variableyi1...iq...in(. . . , aj , . . .) of eq. (3) on the root variable yi1...iq−1...in(. . . , a, . . .) is not significant. These hiddencauses are made visible by means of function substitution, where all the lower-level equations at level Di

k inthe dimension hierarchy are substituted into the higher-level equation under consideration for explanation.In the reporting phase the explanation tree is updated when hidden causes are detected by the multi-level look-ahead method. As in maximal explanation causes are presented to the analyst in the form ofa graphical tree of causes.

4.5 Software implementation

In this section we shortly present the most important concepts of the prototype software implementationof the explanation formalism in MS Excel/Access in combination with Visual Basic. This application isinitially programmed to perform the experiments and analyses necessary for the case study. The back-endof the program is based on Microsoft’s Access database where a statistical model (e.g. multi-way ANOVA)is fitted on the multidimensional source data (the data is assumed to be laid out in a star scheme) andreference values are computed for each data point. Moreover, the Excel front-end (that connects with thedatabase) is composed out of a module for symptom identification and one for explanation generation.In symptom identification the standard deviation is computed for each cell to normalize the residuals.Subsequently, exceptional cells are determined in a certain context of the data cube. The diagnostic module


contains the method for multi-level explanation as well as the look-ahead method. For the implementationof the procedure we applied tree programming to generate trees of significant causes for symptoms. In thetree-viewer the whole explanatory graph can be made visible my manipulating the tree. In addition, thetree of causes is projected on the explanatory graph by highlighting parsimonious causes with a colour.By clicking on the cause under consideration the details for the cause become visible on the screen.

5 Case study: analysis of vehicle criminality data

The methodology developed in this paper is applied on a practical case study. The research for thecase study is carried out as part of the project PROTECT [13]. This project aims to contribute to theknowledge and insight that improve the performance of global supply chains in terms of their reliability.The case study is performed on multidimensional vehicle criminality data obtained from the DutchFoundation for Tackling Vehicle Crime [4]. The goal of the foundation is the reduction of vehicle crime(e.g. theft and fraud) in the Netherlands by means of prevention and by supporting public partners (e.g.police and insurance companies) in investigations. An important way to support the tackling of vehiclecrimes is to perform analyses on vehicle criminality data. Currently, data analyses mostly in the form ofsummary reports, statistics and trends are used for the detection and prevention of vehicle crimes. When,for example, it is known how many cars and trucks are stolen in certain locations, police can patrol theremore often. The vehicle theft data set consists of records describing vehicle thefts in the period 1995-2004.In the Netherlands approximately 30,000 vehicles are stolen every year. The normalized data set consistsof 295,291 records in the fact table and five dimension tables describing hierarchies organized in a starschema shown in Fig. 2. In the dimension tables the numbers within brackets denote the cardinality ofthat level in the dimension hierarchy.

Fig. 2. Star model with five dimension tables and a central fact table.

We applied diagnostic problem solving on this data set to detect and explain exceptional values such as re-gions with a relatively high number of vehicle thefts. Here we present an example of multi-level explanationof a symptom in the dimension Location vehicle stolen of the data cube under consideration. Symptomdetection in a data cube usually starts at an aggregated level, where the analyst has to select a specificcontext, the combination of aggregation levels from the domains, from where to start the explanation gen-eration process. Suppose that an analyst starts exploring the cube in the context (Year,Postal code pos 1)for the measure # stolen vehicles. A postal code in The Netherlands is composed of 4 digits and twoalphabetic letters, for example, 1234 AB is a postal code. Where the first digit represents the most globallocation and the last letter character the most local location. An “?” indicates an aggregate at that posi-tion of the postal code. Under the assumptions of the additive model we pre-computed the expected valuesfor this context, using (6): y(Year,Postal code pos 1) = y(Year,+) + y(+,Postal code pos 1) − y(+,+)and standardize the residuals. We write y for the measure # stolen vehicles. Here δ = 1.645 is determinedas the proper threshold value corresponding to a probability of 95% in the normal distribution. The cellsin Fig. 3 with a color represent the identified symptoms for this context. The program, for example,singles out the cell (2004,3???), because the standardized residual for the first position of the postal code“3???” in the year 2004 (= 2.7726) is larger than the threshold. Therefore, problem identification labelsthis cell as symptom S=∂y12(2004,3???)=“high”. A full specification of the event to be explained is:


Fig. 3. Computed exceptional values (in %) and identified symptoms in context (Year,Postal code pos 1).

Table 1Data for S = {∂y12(2004,3???)=“high”}

actual reference inf(y13, y12)

y(2004,3???) 7362 5411

y(2004,30??) 3154 1465 1689

y(2004,31??) 585 349 236

y(2004,32??) 254 219 35

y(2004,33??) 799 674 125

y(2004,34??) 418 411 7

y(2004,35??) 937 950 -13

y(2004,36??) 187 208 -21

y(2004,37??) 396 416 -20

y(2004,38??) 376 425 -49

y(2004,39??) 256 294 -38

〈(2004,3???),# stolen vehicles=“high”, norm(2004,3???)〉. So we will address the following question:

“Why are the number of stolen vehicles in cell (2004, 3???) relatively high compared with the expectedvalue for this cell in the context under consideration?”

The method in the diagnostic program is configured for one-step look-ahead. In addition, we omit insignif-icant influences from the explanations to prevent the human analyst from an information overload. Afterexperimentation F+ = F− = 0.80 was determined as an appropriate fraction for the data set. Explana-tion generation may start in the “Location vehicle stolen” dimension on the level “Postal code position 1”for the detected symptom, where explanation is sustained by additive relations from the associated di-mension hierarchy: Postal code position 1 � Postal code position 2 � Postal code position 3 � . . .. Theincrease in the number of stolen vehicles in the first digit of postal code region “3???” in the Netherlands isexamined on the second digit level of the postal code. Hence the first corresponding equation using (3) ap-plied for explanation generation is: y12(2001,Postal code pos 1) =

∑10j=1 y13(2001,Postal code pos 2j).

The reference objects in this table follow from the chain of reference objects, and are determined inthe context (Year,Postal code pos 1) with the two-way ANOVA model: y(Year,Postal code pos 2j) =y(Year,+) + y(+,Postal code pos 2j)− y(+,+). Therefore, y12(2001,3???) is the root of the explanationtree. The norm values for explanation generation are based on the expected values for the entries of thedimension level Postal code pos 2 in the context (Year,Postal code pos 2). Computation of the influencesof the individual variables for the additive equation above with (9) yields the results in Table 1. From thedata in this table it can be concluded that C+

p = {y(2004,30??), y(2004,31??)}, since these two relativelylarge causes explain the desired fraction of inf(C+, y(2004,3???)). The set of parsimonious counteractingcauses is given by C−p = {y(2000,36??), y(2000,37??), y(2000,38??), y(2000,39??)}. The parsimoniouscontributing causes are explained further on the levels Postal code pos 3. Postal code pos 4, etc. Theseone-level explanations are combined to a complete diagnosis for the dimension “Location vehicle stolen”.Figure 4, summarizes the results of the multi-level diagnosis in the form of an explanation tree. Where thelines indicate parsimonious contributing causes (dotted lines indicate counteracting causes), the numberson the lines indicate the relative values for the influence measures, and the ratios indicate the specificityvalue of the explanation step. The specificity values are determined using (10). For example, the explana-tion step on the Postal code pos 2 level is very specific for postal codes “30??” and “31??”, because only2 of the 10 possible causes are required here to explain the desired fraction. In summary, the explanationtree depicted in Fig. 4 shows the analyst the set of regions, districts, and streets that are identified as the


Fig. 4. Diagnosis for S = {∂# stolen vehicles(2004,3???) =“high”} for dimension Location vehicle stolen.

largest causes, in the dimension “Location vehicle stolen” for the under consideration, at a glance. More-over, similar explanation trees can be constructed automatically by the analyst for the hierarchies in theother dimensions. A comparison with model results and human analyses showed a large correspondence.In this way the analyst is assisted in processing and analysing large amounts of OLAP data.

6 Conclusions

In this paper, we extended the method for automated diagnosis as described in [2,3] for the explana-tion of exceptional values in OLAP business databases and developed a new prototype implementation.Exceptional cell values are determined based on a statistical model appropriate for multidimensionaldata. Explanation generation is supported by the two internal structures of the OLAP data cube: thebusiness model and the dimension hierarchies. Therefore, we developed a multi-level explanation methodfor finding significant causes in these structures, based on an influence-measure which embodies a formof ceteris paribus reasoning. This method is further enhanced with a look-ahead functionality to detecthidden causes. The result of the explanation generation process is a semantic tree, where the main causesfor a symptom are presented to the analyst. Furthermore, to prevent an information overload to the an-alyst, several techniques are proposed to prune the explanation tree. The methodology is demonstratedby a case study describing the analysis of an OLAP data set with vehicle criminality figures. In the casestudy it is shown with our prototype software that our method is capable of assisting analysts in gener-ating explanations for exceptional values in multidimensional data. The results suggest that our method(semi-)automates the current manual discovery process of problem diagnosis in OLAP databases.

References

[1] Caron E., and Daniels, H.A.M., Automated business diagnosis in the OLAP context, Operations Research Proceedings2004, Springer, H. A. Fleuren and D. den Hertog and P. M. Kort (Eds.), (2005), 425–433.

[2] Feelders, A.J., Diagnostic reasoning and explanation in financial models of the firm, Tilburg University, Department ofEconomics, (1993).

[3] Feelders, A.J., and Daniels, H.A.M., A general model for automated business diagnosis, European Journal of OperationalResearch, 130, (2001), 623–637.

[4] Foundation for Tackling Vehicle Crime, http://www.stavc.nl, (2006).

[5] Han, J., and Fu, Y., Discovery of Multiple-Level Association Rules from Large Databases, VLDB’95, (1995), 420–431.

[6] Heckman, J., Causal Parameters And Policy Analysis In Economics: A Twentieth Century Retrospective, The QuarterlyJournal of Economics, (2000), 115, 1, 45–97.

[7] Hesslow, G., Explaining differences and weighting causes, Theoria, 49, (1983), 87-111.

[8] Hoaglin, D.C., Mosteller, F., and Tukey, J.W., Exploring data tables, trends and shapes, Wiley series in probability,(1988), New York.

[9] Humphreys, P.W., The chances of explanation, Princeton University Press, Princeton, New Jersey, (1989).

[10] Lenz, H.J., and Shoshani, A., Summarizability in OLAP and Statistical Data Bases, Statistical and Scientific DatabaseManagement, (1997), 132–143.


[11] Pedersen, T.B., Jensen, C.S., and Dyreson, C.E., A foundation for capturing and querying complex multidimensionaldata, Inf. Syst., 26, 5, (2001), 383–42.

[12] Pounds, W.F., The process of problem finding, Industrial Management Review, 11, 1, (1969), 1–19.

[13] PROTECT, protecting people, planet and profit, http://protect.transumo.nl, (2006).

[14] Sarawagi, S., Agrawal, R., and Megiddo, R.,Discovery-Driven Exploration of OLAP Data Cubes, Conf. Proc. EDBT’98, (1998), London, UK, 168–182.

[15] Sarawagi, S., iDiff: Informative Summarization of Differences in Multidimensional Aggregates, Data Min. Knowl.Discov., vol. 5, (2001), 255–276.

[16] Scheffe, H., The analysis of variance, Wiley, (1959), New York.

[17] Thalhammer, T., Schrefl, M., and Hohania, M., Active data warehouses: complementing OLAP with analysis rules,Data & Knowledge Engineering, 39, (2001), 241–269.

[18] Verkooijen, W. J., Automated financial diagnosis: a comparison with other diagnostic domains, J. Inf. Sci., 19, 2, May,(1993), 125–135.

vakpublicatie multidimensional business databases protect

Documents