[ieee 2011 ieee 8th international conference on e-business engineering (icebe) - beijing, china...

A new data warehouse approach using graph Manel Zekri1, Imen Marsit1,

Abdelaziz Adellatif1

1 University of Tunis El Manar, Faculty of Sciences of Tunis, Department of Computer Science [email protected], [email protected], [email protected]

Abstract. The standard source data-based approach for data warehouse design, has always been criticized since it ignores the user’s needs. However, this inefficiency is due to the fact that works adopting this trend, consider that the conceptual model or a physical data source is the own and best representative of data sources. All requests subject to operational production bases, which represent punctual needs, were hence always ignored. In this article we consider the history of these requests, in addition to the conceptual data model (CDM), in order to introduce a new approach for an automatic design of a multi-dimensional diagram. The CDM is first translated into a multivariate pattern. Then, it is refined by using the history of requests. Both design steps use graphs as a formalism to represent the data model decision-making.

Keywords: Data warehouse, conceptual data model, graph, production data base, multidimensional schema, data sources, history of queries

I. INTRODUCTION

Building a data warehouse (DWH) consists in making available to managers all or part of the data which were accumulated in the files and databases of transactional systems during many years, and which cannot be efficiently exploited in the decision making process. Thus, before moving to a DWH, a given company has been operational for a period more or less important and so the databases were also in production. On the other hand, most database management systems (DBMS) include a memory function query (called traceability). This traceability can be achieved by a user, through query type, etc. We propose here the exploitation of this history in the DWH design process. Indeed, users have expressed an information need, which was satisfied by these queries and the shift from a production database to a DWH must ensure that such a need will be met. To the best of our knowledge, no design works on DWH approach were based on graphs. Data has benefited from this history although it can inform us about the different analysis performed by users on the source databases. Our goal is to propose a new automatic approach from data sources represented by a conceptual data model (CDM) and the history of queries. More precisely, our approach operates in two steps:

1) The derivation of a first multidimensional schema from an CDM;

2) The refinement of the first scheme by exploiting query history.

The remainder of the paper is organized as follows. Section 2 presents a brief state-of-the art. Then, we introduce our

strategy for generating a first multidimensional schema from a CDM in Section 3. Section 4 will be devoted to the exploitation of the queries. In section 5 we dress an evaluation of our comparison against results obtained throw PowerAMC. Finally, in section 6, we present a conclusion and some issues for future work.

II. STATE-OF-THE ART ON DESIGN APPROACHES

Designing a DWH is now recognized as a crucial task for the success of a storage project [5]. Several approaches have been proposed and those based on data sources can be classified into two categories: (i) data source-guided approaches; and (ii) mixed approaches.

Approaches falling within the first are based on data sources for the derivation of multidimensional schemas. They are also based on the data model business and profit from the relations between data in order to develop a multidimensional diagram in a structured way. This type of approach has been respectively adopted by [2], [4], [6] and [9]. However, these works cannot be automatised. Indeed, despite the derivation of a multidimensional schema based on the identification of the facts, they do not specify a formal criterion for the facts identification from the data model. They are satisfied by the manual identification of these facts. Moreover, such approaches do not incorporate users needs. They ensure information availability but this latter does not guarantee that the user will be satisfied by the DWH. Mixed approaches consider the data sources and user needs in order to ensure that the user finds the information of interest and that this information is available. Among the works that adopt this approach type we can cite [1], [3], [7], [10] and [11]. For operations, [1] and [10] derived a constellation from the data sources validated by users needs. Besides, [3], [7] and [11] derived a set of stars from users needs in order to validate the schema data sources. For the identification of facts, the first approaches consider "entities and n-ary associations with at least one numeric attribute" as a fact. [11] considers that the latter assumption generates a large number of patterns which majority does not correspond to facts. This applies entity and association with digital key attributes or attributes that do not match measures such as "postal code", "phone number". To tackle such deficiencies, they propose a new hypothesis "entity or association with non-key numerical attributes". [3] and [7] are worth of mention since they consider that facts are defined by user needs.

In this article we will operate in addition to the conceptual model of data sources, the whole historic queries in order to propose an automatic method to design a DWH.

2011 Eighth IEEE International Conference on e-Business Engineering

978-0-7695-4518-9/11 $26.00 © 2011 IEEEDOI 10.1109/ICEBE.2011.22

65

The following section details our operating strategy of the CDM.

III. USE A CDM

Ensuring information availability was one of our original objectives. For this purpose, we rely on data sources that are represented by the conceptual data model (CDM) database production (DBP), except that this CDM comprises associations that have not corresponding data sources. This is the case of associations of the type "(x, n), (x, 1)" and "(x, 1), (x, 1)" where x belongs to {0, 1} that do not fit any table in the operational database. To palliate this insufficiency, we propose to go through an intermediate representation of data sources as a directed graph while ensuring (i) differentiation between the association and entity; (ii) elimination of associations that has not corresponding data sources and (iii) the use of cardinalities. This graph will be useful during the following step to exploit the history of the queries. Thus, this step can be divided into two stages:

� The translation of the CDM into a graph; � The use of this graph for the generation of a set of

stars.

A. The Translation of an CDM into a Graph

In the following, we present the rules for the translation of a CDM into a graph. Entities and associations are the nodes and links are arcs. To distinguish between nodes representing entities and those representing associations, entity nodes are represented by circles and associations by rectangles. Nodes have the same attributes as the corresponding association or entity. For links, we propose the following rules:

� The n-ary associations: Arcs outgoing from nodes representing

associations to nodes representing associations. � The binary associations:

- (x, n),(y, n) / {x, y belongs to {0,1}}: two arcs leaving the node representing the association;

- (x, n), (y, 1) / {x, y belongs to {0,1}}: an outgoing arc from node (1, n) to the association and another entering the node from (y,1);

- (x, 1), (y, 1) / {x, y belongs to {0,1}}: an outgoing arc and another entering to the association. The order here is irrelevant.

Arcs are 0-weighted if the minimum cardinality of the link is 0. For links whose minimum cardinality is 1, they are represented by edges weighted as 1. A final step consist s in eliminating any node that does not correspond to data sources. To do this, we consider the node degrees. This is one of the properties linked to the nodes of the graph. Let us consider a node i we have:

� d+(i) is the out-degree of node i. It represents the number of edges with vertex i as initial end, i.e., the number of arcs out of i;

� d-(i) is the in-degree of node i corresponds to the number of arcs that admit vertex i a terminal end, i.e., number of arcs incoming to i;

� d(i)= d+(i)+d-(i) is the degree of node i.

We assume that a rectangular node for which d+(i)= d-(i) does not match data sources and must be, in this case, eliminated. Such a node will be eliminated as well as its incoming arc. Its outgoing arc will be extended in order to connect the predecessor of this node to its successor. This is shown in Figure 1. Indeed, under the logic that we follow, those nodes correspond to type associations "(x, n) (x, 1) " and "(x, 1), (x, 1)" where x ϵ {0,1}. Thus, we define the following rule:

R1: For any node i such that rectangular d+(i)=d-(i), eliminate i from the graph.

Predecessor (successor (i)) = predecessor (i) Successor (predecessor (i)) = successor (i)

Fig. 1: Example of graph refinement

In the following, we present an illustrative example of these rules on a sample CDM (see Figure 2). For reasons of clearness, the attributions of the nodes are not presented.

Fig. 2: Example of application of an extract from a CDM

66

B. Generating a first multidimensional schema

The graph obtained from the previous step will be denoted by GCDM. The authors in [11] only considered the "n-ary associations carrying numerical data not key" as facts. However, this criterion is restrictive. Indeed, there may exist facts which are non-numerical attributes such as quality measures. For example, let as consider the association "to be_present", with non-numerical attribute as "present: boolean. This association can act as a fact to assert a course admission. Indeed, the presence ratio (%) can show us whether or not students are interested in a course. For this, we assume that each association "corresponding to data sources" may be subject to analysis and is therefore a potential fact. These associations are represented in the GCDM by rectangular nodes. Indeed, any subgraph consisting of one rectangular node and circular nodes directly or transitively connected to the former plays the role of a star. For measurements, we distinguish two cases: if the node has a numerical attribute, then it will be regarded as a measure. For nodes without attributes or non-numeric attributes, the number of instances will be considered as a measure and this is consistent with the approach presented in [4]. This latter considered that an act may not have measures, if the only interesting thing is to record the number of occurrences of an event. This fact type is called "empty fact" [4] where we have not measure but whose account is the number of instances.

At the hierarchical level, an occurring problem is the ability to have cycles. For example, the graph depicted in Figure 2 is a star. However, there are cycles: "Course" and "Class" have two common hierarchical levels: "Section" and "Cycle". To handle such a problem, we use the work of [7] where this type of problem has been studied under the name of "Drill_down completeness". We will retain the link with non zero weight. If the weights are equal, it suffices to save one and delete the others. This has no effect on future queries. Figure 3 shows an example of refinement hierarchies.

Fig. 3: Example of refinement hierarchies

For this example, a different weight means, for each instance of the cycle, that there is a class instance. While a zero weight signifies that there may cycles without courses. Now let us suppose that in the future, we’ll have a query of this kind: "Total number of hours" and the corresponding data are as represented by Table 1.

TABLE1.

EXAMPLE OF FUTURE DATA.

Thus, the answer to the previous query can be wrong if the calculation is done using the course "25h" whereas it really is "30h". Having a constellation, we just have to translate the rectangular node into to a fact table and circular nodes into dimension tables. The scheme is obtained from the star of Figure 3 is illustrated in Figure 4.

Fig. 4: Scheme of preliminary facts

IV. EXPLOITING THE HISTORY OF QUERIES

We will extract the entire search history as follows: Select c1, c2, c3. . ., cn, f1, f2. . ., fn From t1, t2,. . ., tn Where conditions (t1, t2, t3,..., tn Group by (c1, c2, c3..., cn)

where ci is a column from the table ti. Fi is an aggregate function applying on a set of columns (sum, min, max, etc). We will consider queries for interrogation and the derogation of updates will be ignored since the need for a user is expressed as a query. We have to first ensure that those queries are valid. Second, the first query must be syntactically correct and no error occurs during its formulation. Third, it is possible that the databases underwents denormalization and the query is raised by considering the new modifications and therefore this query cannot be exploited in our design process and should be ignored. In order to check the validity of a query, we will see action in the CDMG graph. From the previous steps, we obtained a CDMG which represents the foundation production operations. In fact, each table is represented by a node, and the possible join structures between tables are represented by arcs. So, if we can

67

represent queries as graphs, while using the same formalism used to obtain the CDMG, it is sufficient to that the first graph of a query is a subgraph of the CDMG in order to ensures that the complaint is valid. Thus, this step is composed of two substeps: - Representation of queries as graphs; - Exploiting graphs to refine the first multidimensional schema.

A. Representation Of Queries As a Graph

The tables included in the clause "From" are the nodes, and the joins between these tables will be represented by arcs. We represent the tables whose corresponding nodes in the CDMG are rectangular (resp. circular) by rectangular (resp. circular) nodes. Joints are represented by arcs. In addition, we define an attribute "Measure" containing all measures included in the Select clause and to which an aggregate function is applied. Consider the following query calculating the averages of students as represented by Figure 5.

Raverage: The average of student

Select sum (note * coef) / sum (coef), name From takeEx, possessEx, course, exam, student Where takeEx.stu_id = Student.stu_id And takeEx.ex_id = Exam.ex_id And possessEx.ex_id = exam.ex_id And possessEx.mod_id = course.mod_id Group by name, first name Having sum (coef)! = 0;

Fig.5: Representation of the request Raverage

As already mentioned, an application expresses a need. However, it is possible that the need is expressed repeatedly. For example, a query calculating the average of the student and a second one ordering them by class. Here, the needs are embedded. It will be more logical to consider the query expressing the most detailed need. Moreover, the same query can be repeated or expressed in a different way: the same tables whereas the measures are different, reversed, etc. For that purpose, we think of gathering the queries. To do so, we propose the following definitions:

Def1: Similar queries: two queries are similar if they are represented by two similar graphs.

Def2: sub-query: a query R1 is considered as a sub-query of a query R2 if the graph of R1 is a subgraph of the graph of R2.

We will gather the queries in groups, by considering that the similar queries and sub-queries belong to the same group. Each group is characterized by a set of measures, a graph and a weight. As to the set of measures, it is composed of the “Measurement” of each query belonging to the group. As to the graph, it represents the query whose graph is the most detailed. With regard to the weight, it records the number of the queries included in each group. This parameter informs us about the number of times that a need appeared and thus this parameter gives us an idea on the importance of a subject of analysis. This procedure is explained by the following example:

Example: Suppose that a user has addressed the queries R1, R 2 and R3 such as:

R1: averages of the students

R2: averages of the students by class and group

R3: averages of the students by group

With the arrival of the query R1, one defines the group of query GR1 as illustrated by Figure 6:

Fig. 6: Group GR1 With the arrival of R2, after having designed his graph,

one notes that the graph of the GR1 is a subgraph of the graph representing the query. Thus, R2 belongs to the class GR1. For measurements, they are the same and nothing is to be changed except the weight value which is incremented by 1. These modifications are represented by Figure 7.

Fig. 7: Group GR1 after modification

68

The graph of the third query is a sub-graph of the graph of GR1. Thus, the graph of GR1 will not be modified. For measurements, it is about the same measure and therefore nothing is to be changed except the value weight which is incremented.

B. Exploitation Of The Queries

From the previous step, we obtained a set of query groups which is in fact a set of users needs. These needs are evaluated by a weight measuring the frequency and the importance of each need. Moreover, the graph of each group represents a star being able to satisfy the whole needs. Now, two alternatives arise: 1. Translate directly the graph into a star and to satisfy

thus certain user's needs; 2. Enrich the graph by adding all possible dimensions

from the CDMG before translating it into a star. Indeed, the graph of each group represents a need but one can choose to offer all the possibilities to the user since these latter can represent future needs.

We propose to exploit the weight. If the weight is large, then the subject of analysis is critical for the company and it is more logical to offer all the possibilities of analysis. In contrast, if the weight is relatively small, then one can limit to the graph of each group. For that, we define a threshold S for each group GR as follows: If weight , we enrich the graph before translating it into one star. In the opposite case, we limit to the graph of each group.

1) Direct Translation

In this stage, we have a set of query groups. The graph of each query group will be translated into a star. The rectangular nodes will be merged in only one node and the circular nodes will be added as dimensions. With regard to measurements, we propose the following definition:

Def 3: Dependant measurements: Measures sharing a data, for example a measure m and a second m1 such that m1 = F(m) and F is an aggregation function. We then analyze the set of measures of each group. In the case of dependant measures, we only consider only high measures. If this measure belongs to that of the rectangular nodes, we add it to the corresponding fact. As to the others, they can be calculated. For example, if a measure is "Time" (Duration of a session) and a second "sum (Duration)" (Time load), one only keeps "Time". If we want to translate graph GR1 into a star, we have as measure: sum(TakeEx.note*Course.coef)/sum (Course.coef). The raw measurements are: "TakeEx.note" and "Course.coef". With regard to "Course.coef", it belongs to a dimension table and we do not add it. "TakeEx.note" belongs to a rectangular node, so it is added. The translation result is illustrated in Figure 8.

Fig.8: Group GR1 after enrichment.

2) Enrichment Of The graph

In this case, the need is frequently asked and the subject of analysis seems interesting for the company, therefore we try to offer all the possibilities of analysis. We will extract an associated subgraph starting from the CDMG. It is a subgraph involving all rectangular nodes, belonging to the graph representing the group of queries, and of all circular nodes connected to these nodes. In the case of GR1, we must extract a subgraph made up of nodes: "TakeEx", "PossessEx", "Registered" and "appartientgrp". This graph is represented in Figure 9.

Fig. 9: Group GR1 after enrichment. In the same way here, rectangular nodes will be merged in only one node of facts. Circular nodes will be added as dimensions. Since the joints exist, we are thus sure that for each authority in fact, one has an authority for each dimension. Thus, necessary information to feed a star is available. The translation of this star into a diagram is depicted in Figure 10. For the determination of measures and hierarchies, we follow the same procedures as previously mentioned. The final step consists in adding these diagrams of facts to the multidimensional diagram obtained during the first stage. For those obtained directly, they will be directly added. With regard to those obtained after improvement, it is clear that the information contained in those diagrams, gather those contained in diagrams obtained during the first step. For example, for the diagram of Figure 10, it gathers information of: "PossessEx", "TakeEx", "Registered" and "appartientgrp" since its star is made up of stars of the latter. Thus, it would be more logical to remove these initial diagrams. Let us recall that for the diagrams obtained directly, they represent certain needs and cannot contain the

69

whole information. Indeed, it seems that it misses them some dimensions.

V. EVALUATION OF THE APPROACH

In order to evaluate our approach, we considered an CDM representing the education domain and such we don’t have suffisament place in this paper, we will only present a part of the multidimensional schema obtained throw the application of our approach on this CDM and we will compare that result with an other obtained throw the PowerAMC software witch contains an option that allow to obtain a multidimensional schema starting for an CDM.

Fig. 10: Diagram of fact corresponding to the graph of group GR1 after enrichment.

In Table 2, we present a comparison between results

obtained throw our approach against those obtained using PowerAMC.

TABLE2.

COMPARAISON BETWEEN OUR APPOACH AND POWERAMC.

Our approach

PowerAMC

Hierarchy Normalized Not normalized

Measures of facts

All measures are numeric

Some facts are not

numeric Users

satisfaction All users

queries are satisfied

Not guarantied

From an other part, we developed a prototype to evaluate

our approach. This prototype is based on Java as a language, GraphViz for the design of graphs and PowerAMC to visual ate the results of multidimensional schema.

VI. CONCLUSION

In this paper, we presented an approach for the design of a data warehouse based on the data sources. It may be considered original since it exploits the history of queries of the production operational bases. We started with the

derivation of a first multidimensional diagram starting from the CDM representing these data sources. This first derivation is based on a graph obtained from the CDM. The graph is composed of two types of nodes: rectangular nodes playing the corresponding to potential facts and circular nodes corresponding to dimensions. Indeed, each subgraph is made up of at least a node. In fact, it can play the role of a star. However, to generate all possible subraphs (composed of one or more rectangular nodes) we risk to generate a large number of facts whose majority is useless for the user. This is why, we choose to generate, at first, each elementary star composed of only one node. Then, the choice process of subgraphs is guided by the users queries. This approach can be completely automatized guarantees that the data warehouse can be fed starting from the sources. Moreover, we are sure that the needs already expressed by the users and who were satisfied by the operational databases remain satisfied by the data warehouse.

ACKNOWLEDGMENT

We address our thanks to Pr. Zaher Mahjoub for his help.

REFERENCES

1. Bonifati, A., F. Cattaneo, S. Ceri, A. Fuggetta, et S. Paraboschi (2001). Designing data marts for data warehouses. ACM Trans. Softw. Eng. Methodol. 10(4), 452–483

2. Cabibbo, L. et R. Torlone (1998). A logical approach to multidimensional databases. In 6th EDBT, pp. 183–197. Springer.

3. Giorgini, P., S. Rizzi, et M. Garzetti (2008). Grand : A goal-oriented approach to requirement analysis in data warehouses. Decision Support Systems 45(1), 4–21

4. Golfarelli, M., D. Maio, et S. Rizzi (1998). Conceptual design of data warehouses from E/R schema. In Proceedings of Hawaii International Conference on System Sciences 7, 334

5. Golfarelli, M. et S. Rizzi (2009). A survey on temporal data warehousing. IJDWM 5(1), 1–17.

6. Husemann, B., J. Lechtenbörger, et G. Vossen (2000). Conceptual data warehouse design. In Proc. of the International Workshop on Design and Management of Data Warehouses (DMDW 2000), pp. 3–9

7. Mazón, J.-N., J. Lechtenbörger, et J. Trujillo (2009). A survey on summarizability issues in multidimensional modeling. Data Knowl. Eng. 68(12), 1452–1469

8. Mazón, J.-N. et J. Trujillo (2009). A hybrid model driven development framework for the multidimensional modeling of data warehouses. SIGMOD Record 38(2), 12–17

9. Moody, D. L. et M. A. Kortink (2000). From enterprise models to dimensional models : A methodology for data warehouse and data mart design. In Proc. of the Int’l Workshop on Design and Management of Data Warehouses, pp. 5.1–5.12

10. Phipps, C. et K. C. Davis (2002). Automating data warehouse conceptual schema design and evaluation. In DMDW, pp. 23–32

11. Soussi, A. et F. Gargouri (2005). Génération et validation automatiques de schémas de magasins de données. In Tunisie :GEI’05

70

[ieee 2011 ieee 8th international conference on e-business engineering (icebe) - beijing, china...

Documents