summarizing data with cube and rollup

Summarizing Data with CUBE and ROLLUP

SQL ROLLUP and CUBE commands

• Quick, efficient way to summarize the data stored in your database

• Offer a valuable tool for gaining some quick and dirty insight into your data

• ROLLUP and CUBE are SQL extensions and they're available in MySQL, SQL Server 6.5 (and above) and Oracle 8i (and above)

CUBE OperationPet Table

Type Store Number

Dog Miami 12

Cat Miami 18

Turtle Tampa 4

Dog Tampa 14

Cat Naples 9

Dog Naples 5

Turtle Naples 1

SQL: select Type, Store, sum(Number)from oPetgroup by type, storeWITH CUBE

Analyzing the data, you'll notice that our chain has 27 cats, 31 dogs and 5 turtles spread among our three stores. Our Miami store has the largest number of pets in stock with a whopping inventory of 30 pets.

Type Store Number

Cat Miami 18

Cat Naples 9

Cat NULL 27

Dog Miami 12

Dog Naples 5

Dog Tample 14

Dog NULL 31

Turtle Naples 1

Turtle Tample 4

Turtle NULL 5

NULL Miami 30

NULL Naples 15

NULL Tampa 18

NULL NULL 63

ROLLUP OperationPet Table

Type Store Number

Dog Miami 12

Cat Miami 18

Turtle Tampa 4

Dog Tampa 14

Cat Naples 9

Dog Naples 5

Turtle Naples 1

SQL: select Type, Store, sum(Number)from Petgroup by type, storeWITH ROLLUP

Type Store Number

Cat Miami 18

Cat Naples 9

Cat NULL 27

Dog Miami 12

Dog Naples 5

Dog Tample 14

Dog NULL 31

Turtle Naples 1

Turtle Tample 4

Turtle NULL 5

NULL NULL 63

We're not particularly interested in the total number of pets at each store -- we'd just like to know our statewide inventory of each species along with

the standard GROUP BY data.

NULL Miami 30

NULL Naples 15

NULL Tampa 18

Utilizing the ROLLUP operator instead of the CUBE operator will eliminate the results that

contain a NULL in the first column.

ท่ีมา: http://databases.about.com/od/sql/l/aacuberollup.htm

• Show ‘how to use Excel to do CUBE and ROLLUP operations’

Aggregations

• It simply means a summarized, typically additive value.

• The level of aggregation in a star schema depends on the scenario.

• Many star schemas are aggregated to some base level, called the grain, although this is becoming somewhat less common as developers rely on cube building engines to summarize to a base level of granularity.

Cartesian-Product Operation – Example

Relations r, s:

r x s:

A=C(r x s):

Relations r, s:

r x s:

A=C(r x s):

Natural Join – Example

r s

Same domainJoin operation is very slow.

Natural Join – Example• Relations r, s:

r s

Database Design?• Is there any problem with this design?

departmentinstructor

Normalization

Comparison

Normalization+ Reduce space+ Small table size+ Reduce update, insert, and delete

anomaly+ Reduce data entry+ Use less index+ Fast index since we use numeric to

represent index+ Good for update, insert, and delete- Increase number of joins in

multiple tables

Denormalization+ Fast query: query fewer tables+ Reduce number of joins (fast

query processing time)+ Good for data retrieval+ Easy for end user to

understand data- Increase Redundancy data- Big table size- Increase update, insert, and

delete anomaly

Which table is much easier to understand?

Creating a view that looks like this, however,

still uses joins in the background and therefore

does not achieve the best performance on

the query.

Designing the Star Schema Database

Fortunately, there is a better way.

Motivation

• Creating a Star Schema Database is one of the most important step in creating a data warehouse.

• It is important to understand how to move from a standard, on-line transaction processing (OLTP) system to a final star schema.

For example: •Assume a company sells only two products: dog food and cat food. •Each day, the company records the sales of each product. •At the end of a couple of days, the data looks like this:

• Each day contains several transactions. •This is the data as stored in a standard OLTP system. • However, the data warehouse might not record this level of detail. Instead, it could summarize, or aggregate, the data to daily totals. •The records in the data warehouse might look something like this:

Date

Quality Sold

Dog Food Cat Food

4/24/2012 15 13

4/25/2012 9 8

Pet Table

Date Order NumberQuantity Sold

Dog Food Cat Food

4/24/2012 1 5 2

2 3 0 3 2 6 4 2 2 5 3 3

4/25/2012 1 3 7

2 2 1 3 4 0

•This summarization of data reduces the number of records by aggregating the individual transaction records into daily records that show the number of each product purchased each day.

•In this simple example, it is easy to derive the table simply by running a query against Pet table. select date, sum(Dog Food) as Dog Food, sum(Cat Food) as Cat Food from pet group by date

•However, many complexities enter the picture that will be discussed later.

Date

Quality Sold

Dog Food Cat Food

4/24/2012 15 13

4/25/2012 9 8

Date Order Number Dog Food Cat Food

4/24/2012 1 5 2

4/24/2012 2 3 0

4/24/2012 3 2 6

4/24/2012 4 2 2

4/24/2012 5 3 3

4/25/2012 1 3 7

4/25/2012 2 2 1

4/25/2012 3 4 0

OLTP Systems

• Online Transaction Processing, systems are standard, normalized databases.

• OLTP systems are optimized for inserts, updates, and deletes; in other words, transactions.

• Transactions in this context can be thought of as the entry, update, or deletion of a record or set of records.

• They minimize repeated data, and they limit the number of indexes

Flat Structure in Excel

Problem: Repeat the header information in each detail record.

Normalization• Database normalization is the process of removing repeated information (field) into a separate table. • Normalization first removes repeated records in a table. For example, the following order table contains much repeated information and is not recommended:

If there were twenty repeated sets of fields for detail records, the table would be unable to handle an order for twenty one or more products. If an order has just has one product ordered, all the other fields are useless.

The first step in the normalization process is to break the repeated fields into a separate table, and end up with this:

Now, an order can have any number of detail records.

Reduce the data entry requiredReduce the size of an OrderDetail record

ER-Diagram

OrderID LineItem ProductID QuantityOrdered

12345 1 A13J2 200

12345 2 J44RR 300

12345 3 GA1AE 1000

To reduce repeated data, the new detail records might look like this:

OLTP schemas store as little data as possible to speed inserts, updates, and deletes. Therefore, any number that can be calculated at query time is calculated and not stored.

OLTP Advantages• OLTP allows for the minimization of data entry

– Only the primary key value from the OrderHeader table, the primary key of the Product table, and then the order quantity is stored

– Built for data inserts, updates, and deletes, but not data retrieval• OLTP reduces the size of an OrderDetail record.

– help speed inserts, updates, and deletes.• Most of the fields that link to other tables are numeric.

– Queries generally perform much better against numeric fields than they do against text fields.

– Replacing a series of text fields with a numeric field can help speed queries.– Numeric fields also index faster and more efficiently.

• With normalization, there are frequently fewer indexes per table. – With fewer indexes to maintain, inserts, updates, and deletes run faster.

OLTP Disadvantages

• With a normalized structure, queries must utilize joins across multiple tables to get all the data.

• Joins tend to be slower than reading from a single table, so minimizing the number of tables in a query will boost performance.

• The data in an OLTP system is not user friendly.– Most customers don't know how to make sense of the

normalized structure of the database. Hard to query– Joins are somewhat mysterious, and complex table

structures are difficult for the average customer to use.

OLTP Disadvantages• The fewer indexes in the database, the faster inserts, updates, and deletes

will be. • The fewer indexes in the database, the slower select queries will run. • For the purposes of data retrieval, a higher number of correct indexes helps

speed retrieval. • Since one of the design goals to speed transactions is to minimize the

number of indexes, OLTP databases trade faster transactions at the cost of slowing data retrieval.

• This is one reason for creating two separate database structures: – an OLTP system for transactions– an OLAP system for data retrieval.

OLTP Systems

• OLTP allows for the minimization of data entry• Developers minimize the number of indexes in an

OLTP system. • Indexes are important but they slow down inserts,

updates, and deletes. • Therefore, most schemas have just enough indexes

to support lookups and other necessary queries. • Over-indexing can significantly decrease

performance.

Reason to Denormalize• If customers want to spend the majority of their time performing analysis

by looking at the data, the IT group should support their desire for fast, easy queries.

• On the other hand, maintaining the speed requirements of the transaction-processing activities is critical.

• If these two requirements seem to be in conflict, they are, at least partially. • Many companies have solved this by having a second copy of the data in a

structure reserved for analysis. • This copy is more heavily indexed, and it allows customers to perform large

queries against the data without impacting the inserts, updates, and deletes on the main data.

• This copy of the data is often not just more heavily indexed, but also denormalized to make it easier for customers to understand.

Reason to Denormalize

• Users query a lot. Therefore– Need to build many index to speed up query– Need to denormalize table to reduce join

operation and remove users’ confusion of tables’ structure

How Humans View Information

• How many bottles of Aniseed Syrup were sold last week?

• Are overall sales of Condiments up or down this year compared to previous years?

• On a quarterly and then monthly basis, are Dairy Product sales cyclical?

• In what regions are sales down this year compared to the same period last year? What products in those regions account for the greatest percentage of the decrease?

How Humans View Information

• Tracking measure• Aggregation data (sum, count)• Time involved• By certain condition

Fact Table

Dimension Table

Star Schema

• What people want to see (Event, Measures)– Fact Table (e.g., Sales)

• How they want to see it– Dimension Table

View data by time, by location, by product

Dimension Table

• All dimension tables should have a single-field primary key.

• This key is typically a surrogate key and is often just an identity column, consisting of an automatically incrementing number.

• The value of the primary key is meaningless, hence the surrogate key.

• The real information is stored in the other fields. – These fields do not contain codes that link to other tables.

Because the fields contain full descriptions, the dimension tables are often short and fat; they contain many large fields.

fat , short table

Dimension Table• Short and fat.• Cannot compare in size to a normal fact table. • For example,

– Assume a company has 30,000 products in the product table.– The company may track sales for these products each day for

several years. – Assuming the company actually only sells 3,000 products in any

given day.– If they track these sales each day for ten years, the fact table has

• 3,000 products sold X 365 day/year * 10 years = 11,000,000 records • The dimension table with 30,000 records will be short compared to the

fact table.

fat , short table

Fact Table• Keep fact or measure data by conditions (long)

– Measures are numeric and additive across some or all of the dimension tables– For example, sales are numeric and users can look at total sales for a product, or

category, or subcategory, and by any time period. The sales figures are valid no matter how the data is sliced.

• Use Primary Key of each Dimension table (skinny)

skinny , long table

Fact TableLong Table - hold the number of records represented by the product of the counts in all the dimension tables. Therefore, there are many many records.

Skinny table - hold the fields which are the primary key (foreign keys ) of the dimension tables. These fields are typically integer values. In addition, the measures are also numeric. Therefore, the size of each record is generally much narrower than those in the dimension tables.

skinny , long table

Fact Table

• 10 years of daily data• 200 stores• 500 products

Fact Table size: 365,000,000 records (3650 days * 200 stores * 500 products)

• This large number of records makes the fact table long.

• If add another dimension, such as a dimension of 10,000 customers, the record of the fact table can be increased up to 10,000 times.

skinny , long table

Query1: the total sales for a particular year and month for a particular category the total SalesDollars for March, 1999 for category = “Brass Goods”

SQL query:SELECT Sum(SF.SalesDollars) AS SumOfSalesDollarsFROM TimeDimension T , ProductDimension P, SalesFact SFWHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductID AND P.Category='Brass Goods' AND T.Month=3 AND T.Year=1999

To drill down to a subcategory, the SQL would change to look like this:

SELECT Sum(SF.SalesDollars) AS SumOfSalesDollarsFROM TimeDimension T, ProductDimension P, SalesFact SF WHERE P.ProductID = SF.ProductID AND T.TimeID = SF.TimeID AND P.SubCategory='Widgets' AND T.Month=3 AND T.Year=1999;

Query2: The total SalesDollars for March, 1999 by category

SQL query:SELECT P.Category AS Category, Sum(SF.SalesDollars) AS SumOfSalesDollarsFROM TimeDimension T , ProductDimension P, SalesFact SFWHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductID AND T.Month=3 AND T.Year=1999GROUP BY P.Category

Query3: The total SalesDollars by month and by brand

SQL query:SELECT P.Month AS Month, P. Brand AS Brand, Sum(SF.SalesDollars) AS SumOfSalesDollarsFROM TimeDimension T , ProductDimension P, SalesFact SFWHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductID GROUP BY T.Month, P. Brand

Snowflake SchemasA Dimension tables is fat so we can broken out it into separate tables. Save space.This is a more normalized structure, but leads to more difficult queries and slower response times.

Difficult to drill-down into data

Snowflake Schemas

• Increases the number of joins and can slow queries• The purpose of an OLAP system is to speed queries,

snowflaking is usually not productive. • Some people try to normalize the dimension tables to

save space.• However, in the overall scheme of the data warehouse,

the dimension tables usually only account for about 1% of the total storage.

• Any space savings from normalizing, or snowflaking, are negligible.

Fact Granularity

• The granularity, or frequency, of the data is determined by the lowest level of granularity of each dimension table.

• For example, a fact table may store weekly or monthly totals for individual products.

• The lower the granularity, the more records that will exist in the fact table.

• The granularity also determines how far users can drill down without returning to the base, transaction-level data.

Fact Table Size

• 500 products sold in 200 stores and tracked for 10 years – could produce 365,000,000 records in a fact table with

a daily grain (maximum size for the table)– Star schemas do not store zero values unless zero has

some significance. – So, if a product did not sell at a particular store for a

particular day, the system would not store a zero value. – The fact table contains only the records that have a

value.

500 x 200 x 10 x 365

Fact Table Size• The lower the granularity, the larger the fact table. • In the previous example, moving from a daily to weekly grain

would reduce the potential number of records to only slightly more than 52,000,000 records (500 products x 200 stores x 10 years x 52 weeks)

• The data types for the fields in the fact table are numeric, which can require less storage space than the long descriptions we find in the dimension tables.

• Be aware that each added dimension can greatly increase the size of the fact table. – For example, adding customer dimension table that has 20 possible

values (record), the potential number of records would reach 7.3 billion.

500 x 200 x 10 x 365 x 20 = 7,300,000,000 records

Reference

• Designing the Star Schema Databasehttp://

www.ciobriefings.com/whitepapers/StarSchema.asp

http://databases.about.com/od/sql/l/aacuberollup.htm

http://www.ciobriefings.com/whitepapers/StarSchema.asp





summarizing data with cube and rollup

Documents

cr x s

data entry

summarizing data

cube operator

data warehouse

cube commandsquick

data retrieval easy

index fast index