analysis of additivity in olap systems john horner and il-yeol song [email protected] college...

32
Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song [email protected] College of Information Science & Technology Drexel University Philadelphia, PA 19104 USA Peter P. Chen Department of Computer Science Louisiana State University Baton Rouge, LA 70803

Upload: barrie-merritt

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

Analysis of Additivity in OLAP Systems

John Horner and Il-Yeol Song [email protected]

College of Information Science & Technology Drexel University

Philadelphia, PA 19104USA

Peter P. ChenDepartment of Computer Science

Louisiana State UniversityBaton Rouge, LA 70803

Page 2: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

2

Online Analytical Processing (OLAP) Systems

• Historical, integrated, relatively static data

• Magnitudes larger than transactional systems

• Used for strategic decision making

• Query outputs nearly always aggregated sets of base data

• Effective summarizability is of paramount concern

Page 3: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

3

Structure

• Facts are measures of interest

• Dimensions are attributes used to identify, select, group, and aggregate measures of interest.

• Attributes that are used to aggregate measures are labeled classification attributes, and are typically conceptualized as hierarchies

Page 4: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

4

Operations

• Roll-up increases the level of aggregation along one or more classification hierarchies

• Drill-down decreases the level of aggregation along one or more classification hierarchies

• Slice-Dice selects and projects the data• Pivoting reorients the multi-dimensional data

view to allow exchanging facts for dimensions symmetrically

• Merging performs a union of separate roll-up operations

Page 5: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

5

Additivity

• The ability to use the aggregate summation operator to accurately summarize data is known as Additivity

• A measure is Additive along a dimension if the sum operator can be used to meaningfully aggregate values along all hierarchies in that dimension

• Fully-additive measures are additive across all dimensions

• Semi-additive measures are only additive across certain dimensions

• Non-additive measures are not additive across any dimension

Page 6: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

6

Additivity Example

100001 100002 100003 100004 TOTAL

1/1/2000 500 700 9890 600 ADDITIVE

2/1/2000 800 450 10050 200 …

3/1/2000 980 900 8700 800 …

4/1/2000 400 360 7800 750 …

… … … … … …

TOTAL NON-ADDITIVE

… … … …

Date

Customer

Page 7: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

7

Classification Examples

1.0 Non-Additive

1.1 Fractions

1.1.1 Ratios GMROI, Profitability ratios

1.1.2 Percentages Profit margin percent, return percentage

1.2 Measurements of intensity Temperature, Blood pressure

1.3 Average/Maximum/Minimum

1.3.1 Averages Grade point average, Temperature

1.3.2 Maximums Temperature, Hourly hospital admissions, Electricity usage, Blood pressure

1.3.3 Minimums Temperature, Hourly hospital admissions, Electricity usage, Blood pressure

1.4 Measurements of direction Wind direction, Cartographic bearings, Geometric angles

1.5 Identification attributes

1.5.1 Codes Zip code, ISBN, ISSN, Area Code, Phone Number, Barcode

1.5.1 Sequence numbers Surrogate key, Order number, Transaction number, Invoice number

2.0 Semi-Additive

2.1 Dirty Data Missing data, Duplicate data, Incorrect data

2.2 Changing data Area codes, Department names, customer address

2.3 Temporally non-additive Account balances, Quantity on hand, Quantity sold

2.4 Categorically non-additive Basket counts, Quantity on hand, Quantity sold

Classification

Page 8: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

8

Non-Additive Measures

• Ratios and Percentages

• Measures of Intensity

• Average / Maximum / Minimum

• Measures of Direction

Page 9: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

9

Semi-Additive Facts

• Dirty Data

• Changing Data

• Temporally Non-Additive

• Categorically Non-Additive

• Not Mutually Exclusive– e.g. Measures can be both temporally and

categorically non-additive

Page 10: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

10

Causes of Dirty Data

• Summing measures associated with dirty data can result in inaccurate summaries if not all instances are counted, if instances are counted multiple times, or if instances are counted in the wrong group

CustomerID

000001 999999

Actual Customers

Customers as Stored in Database012454

201454

745654

Arbitrary Missing Data Value

Customer who pre-dates system

Page 11: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

11

Rolling-up Dirty DataTransactionsClassification Hierarchy

Anomaly will disappear when rolled up to the country level

Anomaly will disappear when rolled up to the zip code level

Anomaly will disappear when rolled up to the State level

• As measures are rolled up further along hierarchies, certain inaccurate values will be merged into the appropriate groups

Page 12: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

12

Hierarchy Completeness• All instances belong to one higher level instance, which consists of

those instances only• Complete hierarchy (top), country consists of only the provinces

listed• Incomplete hierarchy (bottom), not all customers in the city are

stored in the data warehouse; or not all customers in data warehouse have a city listed

C1

Pro1 Pro2 Pro3

City

Cust1 Cust2 Custn

Country

Province

City

Customer

Complete

Incomplete

Custx

Page 13: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

13

Example of Additivity Problems Associated with Incomplete Hierarchies

CustID City SalesAmt

1 Washington 100

2 New York 200

999 Unknown 100

4 New York 150

5 Washington 150

6 Washington 150

999 Unknown 100

Total 950

• If Sales are rolled up to the city, but not all customers have a city stored in the database, then the summary will not accurately portray the sales grouped by city.

City Sales

Washington 400

New York 350

Total 750

Unknown 200

Summary

Page 14: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

14

Changing Data

• It is important to track merges, splits, and overlapping hierarchies, especially those that affect classification hierarchies, as the characteristics of the data and environment change

Page 15: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

15

Changing Data Example

Year City Area Code Population

1990 Philadelphia 215 200

2000 Philadelphia 610 150

2000 Philadelphia 215 150

2000 Philadelphia 484 100

• Area code 215 split into 3 area codes. Looking at population trend in 215 area code would show a decrease, when in fact population in area originally covered by 215 area code has doubled.

Page 16: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

16

Temporally Non-Additive

• Measures that cannot be meaningfully added across different time periods are temporally non-additive

• Examples– Account balances– Quantity on hand

Page 17: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

17

Temporally Non-Additive Example

Date 100001 100002 100003 100004 TOTAL

1/1/2000 500 700 9890 600 …

2/1/2000 800 450 10050 200 …

3/1/2000 980 900 8700 800 …

4/1/2000 400 360 7800 750 …

… … … … … …

TOTAL NON-ADDITIVE

… … … …

Page 18: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

18

Temporally Non-Additive SQL

Select sum(balance), CustomerID

From AccountFact

Group by CustomerID;

Select sum(balance), date

From AccountFact

Group by date; Must group by time interval of snapshot

Page 19: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

19

Categorically Non-Additive

• Measures that cannot meaningfully be summed across different types of items can be considered categorically non-additive

• Examples– Basket counts– Quantity on hand

Page 20: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

20

Categorically Non-Additive Example

Date Customer Item ID Product Name

… Basket Count

1/1/2000 1 10001 X Brand Soup

… 5

1/1/2000 1 10002 Y Brand Soup

… 2

1/1/2000 2 12510 Z Brand Television

… 1

1/1/2000 3 10001 X Brand Soup

… 4

… … … … … …

TOTAL … … … … NON-ADDITIVE

Page 21: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

21

Categorically Non-Additive SQL

Select sum(BasketCount)

From SalesFact;

Select sum(BasketCount), ProductName

From SalesFact

Group by ProductName;

Must group by attribute in product family hierarchy

Page 22: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

22

Others’ Suggestions• The distinction between meaningful and meaningless aggregation data should be stored in an

appendix» Hüsemann et al (2000)

• Data should be normalized into a General Multidimensional Normal Form (GMNF), whereby aggregation anomalies are avoided through a conceptual modeling approach that emphasizes sorting out dimensions, dimensional hierarchies, and which measures belong where.

» Hüsemann et al (2000)

• Conceptual models should explicitly depicts hierarchies and aggregation constraints along hierarchies, and a fact glossary should be developed describing how each fact was derived from an ER model

» Golfarelli and Rizzi (1998)

• We need to rigorously classify hierarchies and detailed characteristics of hierarchies, such as completeness and multiplicity

» Pourabbas and Rafanelli (1999)

• Slowly Changing Dimensions (Kimball and Ross, 2002)

– Type 1: simply overwriting data – Type 2: storing the new data instance in a new row, but with a common field to link the dimensions as being

the same – Type 3: Adding a new attribute to the dimension table to store both the new and old values

Page 23: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

23

Our Suggestions

• No simple solution– Can’t always eliminate potential inaccuracies

– Categorically Non-additive data

– Glossaries may be ignored– Conceptual models may be overly complex– This doesn’t mean that we shouldn’t have glossaries and include

constraints in conceptual models

• Online Summarizability Constraints– Imagine abundance of update anomalies in transactional

systems if possible violations are only stored in glossaries or conceptual models

• Where measures are imprecise, queries should show error bounds

Page 24: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

24

Hierarchies

• Strict - each object at a lower level belongs to only one value at a higher level

• Non-strict - can be thought of as a many-to-many relationship between a higher level of the hierarchy and the lower level

• Complete - all members belong to one higher-class object, which consists of those members only

• Incomplete – not complete• Multiple path - lower object splits into two distinct higher

level objects • Alternate path - multiple path hierarchy that joins again

at a higher level

Page 25: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

25

Hierarchy Strictness

• In strict hierarchies, lower level instances in hierarchy belong to only one higher level instance

D1 D2

Pr1 Pr2 Pr3 Pr4 Pr5

Department

Project

D1 D2

P1 P2 P3 P4 P5

Department

PersonStrict

Non-Strict

Page 26: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

26

Example of Additivity Problems Associated with Non-Strict Hierarchies

Project Dollars

1 10000

2 15000

3 120000

4 50000

5 30000

Total 225000

Dept Project Dollars

1 1 10000

1 2 15000

1 3 120000

2 3 120000

2 4 50000

2 5 30000

Total 345000

Denormalized Fact Table

Page 27: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

27

Alternate and Multiple Path Hierarchies

• Inaccurate summaries can result from merging aggregates from multiple paths of a hierarchy.

Date

Month

Quarter

Year

DayOfWeekWeek

b. Multiple Path Classification Hierarchy

Store

City

County

State

ZipCode

Country

AreaCode

a. Alternate Path Classification Hierarchy

Page 28: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

28

Example of Problems Associated with Merging Multiple Path Hierarchies

Person Dept Project Hours

1 1 1 40

2 1 2 100

3 2 2 50

4 2 2 50

5 2 2 40

6 2 2 80

• Adding Hours from all the people in Department 1 with all the people who worked on Project 2 results in an inaccurate summary because Person 2 is counted twice.

• The summary would not be inaccurate if each project mapped directly to 1 department

Person

ProjectDepartment

Multiple Path Hierarchy

140 hrs

320 hrs

460 hrs Should be 360 hrs

Page 29: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

29

Our Suggestions (Cont.)

Page 30: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

30

Our Suggestions (Cont.)

Page 31: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

31

Conclusions• Recognizing whether measures are fully-, semi-, or non-additive is

essential to identifying and resolving potential inaccurate summaries in OLAP systems

• Non-additive measures cannot be aggregated using the sum operator

• Semi-additive measures can sometimes be aggregated using the sum operator, but at other times cannot

• Therefore, semi-additive attributes pose the highest risk for unrecognized inaccurate summaries

• There are several reasons why data could be semi-additive– Adding different types of items together– Adding measures multiple times in the same summary– Not including all instances when aggregating measures– Including measures in the wrong groups

• Metadata could be used to alert analysts to potentially inaccurate queries

Page 32: Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University

32

References• Golfarelli, M., Maio, D., and Rizzi, S. (1998). Conceptual Design of Data

Warehouses from E/R Schemes. Proceedings of the Thirty-First Hawaii International Conference, 6-9 Jan. 1998, 7, 334 – 343.

• Hüsemann, B., Lechtenbörger, J, and Vossen, G. (2000). Conceptual data warehouse design. Proc. International Workshop on Design and Management of Data Warehouses, 2000.

• Kimball, R. and Ross, M. (2002). The Data Warehouse Toolkit: Second Edition. John Wiley and Sons, Inc.

• Pourabbas, E. and Rafanelli, M. (1999). Characterizations of hierarchies and some operators in OLAP environments..Proceedings of the 2nd ACM international workshop on Data warehousing and OLAP. Kansas City, Missouri. 54 – 59.

• Shoshani, A. (1997) OLAP and statistical databases: Similarities and differences. Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. Tucson, Arizona. 185 – 196. ACM Press New York, NY.