acctg 6910 building enterprise & business intelligence systems (e.bis)

Post on 02-Jan-2016

19 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis). Physical Data Warehouse Design. Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business. It’s all about trading storage for speed!. Fundamentals Aggregates (Ch. 16, pp. 356 - 357) - PowerPoint PPT Presentation

TRANSCRIPT

1

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

Physical Data Warehouse Design

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

2

It’s all about trading storage for speed!

• Fundamentals• Aggregates (Ch. 16, pp. 356 - 357)• Indexes (Ch. 16, p. 357)

3

Fundamentals: the Storage Hierarchy

CPU

Cache

Memory

Disk

Storage Capacity

Small

Large

Access Speed

Slow

Fast

10-8 second

10-7 second

10-2 second

500-1000 MIPS

512 KB

512 MB

512 GB

4

Fundamentals: the Storage Hierarchy

CPU

Memory Disk

Disk Drive (I/O Channel)

Cache

Bus

How long does it take to query sales by city?How large is the Sales Fact table?How long does it take to access the Sales Fact table?

5

FundamentalsHow large is the fact table?e.g., 1 million records/day, 0.2KB/record 0.2

GB/day

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

6

Fundamentals

How long does it take to access all the fact records?

E.g., the small fact table is 1 Terabyte in size!

– 0.01s*1012=325 years LONG!!!!!!!!!!!!!

7

Fundamentals: the Storage Hierarchy

CPU

Memory Disk

Disk Drive (I/O Channel)

Cache

Bus

The logical unit of data transferred between disk and memory is block (e.g., 4k bytes)

8

Fundamentals

How long does it take to access all the fact records?

E.g., the small fact table is 1 Terabyte in size!

– Number of blocks: 2.5 millions– Access time = 0.01s*2500000= < 7

hrs!!!

9

Aggregate

• In data warehouse design, we choose the gain of fact table to be the possible lowest level.

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

Grain: orderline

10

Aggregate

• The reasons to choose the lowest level of fact: – (X) Analysts want to query on single

record

– (O) Analysts want to flexibly cut and group records.

11

Aggregate

• However, keeping the most detailed fact records could result in

– huge-size fact table: TeraBytes?! (1 million records/day, 256 Bytes/record

-> 0.2 GB/day)

– slow query

12

Aggregate

• To keep s data warehouse flexible, fact tables need to store facts in their lowest levels of detail.

• To improve query performance, another type of fact table which stores pre-computed summaries of detailed facts helps.

• Reduced to a logical design solution

13

Aggregate

• An aggregate fact table is a fact table that summarizes base-level fact table records

along one or several dimensions.• An aggregate dimension table is a

dimension table that summarizes base-level dimension table records.

• E.g., marketing managers check daily product sales by city --- aggregate by city in customer dimension

14

Aggregate

CUST_CITY# CITY_KEY* CITY* STATE

SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH

ref

ref

ref

ref

ref

refAggregate fact table

Aggregate dimension table

15

Aggregate

SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY

CUST_CITY# CITY_KEY* CITY* STATE

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

CUSTOMER# CUSTOMER_KEY* CID* CNAME* CITY* STATE

TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH

ref

ref

ref

ref

ref

ref

ref

ref

refref

ref

ref

16

Indexes

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

How long does it take to

find out the total purchase

Amt by Tom Jones?

17

Indexes

• Customer table– 1M records, each record 0.200 Kbytes

long– Block is 4K size, block access time is 0.01s– Number of records/block: 4/0.2=20 – Number of blocks: 1M/20=50K

• Sequential search– Time: 25K*0.01s=250s=4min.

18

Indexes

• Binary search– Time: log(50K)*0.01s=16*0.01s=0.16s

• B+ tree index– Create index pn on customer(cname);– If each node (block) in B+ tree has 117 keys, then

• # of access to indexes: log117(1M)=3 (i.e.height of the tree)

• # of access to Customer Dimension: 1• Total time = 4*0.01 = 0.04s

19

...(11 key values, 12 pointers)

...

B+-trees - P=12

Indexes to customer records

……….

Indexes to indexes

20

Indexes

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

How long does it take to

find out the total sales of

Desktop computers?

21

Performance Improvement

• Suppose there are only 4 product categories for 1M products

• Create a B+ tree index???– Suppose the size of product category

and block ID is 10 bytes– Size of index = 1M * 10 = 10 M bytes

22

Performance Improvement

• A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.

23

Bitmaps

Desktop 1 0 1

Notebook 0 1 0

Server 0 0 0

Accessory

0 0 0

Product record 1 record 2 record 3

A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.

24

Performance Improvement

• Bitmap index is suitable for low cardinality attribute.– Cardinality(A) = # of possible values for A/#of records

• Compared with B+ tree index, bitmap index has the following advantages for low cardinality attributes– Storage space saving (1M*4/8=500K bytes)– Efficient for boolean operations

• CREATE BITMAP INDEX bitpc ON PRODUCT (PCNAME);

top related