acctg 6910 building enterprise & business intelligence systems (e.bis)

24
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business

Upload: silas-cochran

Post on 02-Jan-2016

19 views

Category:

Documents


0 download

DESCRIPTION

ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis). Physical Data Warehouse Design. Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business. It’s all about trading storage for speed!. Fundamentals Aggregates (Ch. 16, pp. 356 - 357) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

1

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

Physical Data Warehouse Design

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Page 2: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

2

It’s all about trading storage for speed!

• Fundamentals• Aggregates (Ch. 16, pp. 356 - 357)• Indexes (Ch. 16, p. 357)

Page 3: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

3

Fundamentals: the Storage Hierarchy

CPU

Cache

Memory

Disk

Storage Capacity

Small

Large

Access Speed

Slow

Fast

10-8 second

10-7 second

10-2 second

500-1000 MIPS

512 KB

512 MB

512 GB

Page 4: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

4

Fundamentals: the Storage Hierarchy

CPU

Memory Disk

Disk Drive (I/O Channel)

Cache

Bus

How long does it take to query sales by city?How large is the Sales Fact table?How long does it take to access the Sales Fact table?

Page 5: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

5

FundamentalsHow large is the fact table?e.g., 1 million records/day, 0.2KB/record 0.2

GB/day

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

Page 6: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

6

Fundamentals

How long does it take to access all the fact records?

E.g., the small fact table is 1 Terabyte in size!

– 0.01s*1012=325 years LONG!!!!!!!!!!!!!

Page 7: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

7

Fundamentals: the Storage Hierarchy

CPU

Memory Disk

Disk Drive (I/O Channel)

Cache

Bus

The logical unit of data transferred between disk and memory is block (e.g., 4k bytes)

Page 8: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

8

Fundamentals

How long does it take to access all the fact records?

E.g., the small fact table is 1 Terabyte in size!

– Number of blocks: 2.5 millions– Access time = 0.01s*2500000= < 7

hrs!!!

Page 9: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

9

Aggregate

• In data warehouse design, we choose the gain of fact table to be the possible lowest level.

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

Grain: orderline

Page 10: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

10

Aggregate

• The reasons to choose the lowest level of fact: – (X) Analysts want to query on single

record

– (O) Analysts want to flexibly cut and group records.

Page 11: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

11

Aggregate

• However, keeping the most detailed fact records could result in

– huge-size fact table: TeraBytes?! (1 million records/day, 256 Bytes/record

-> 0.2 GB/day)

– slow query

Page 12: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

12

Aggregate

• To keep s data warehouse flexible, fact tables need to store facts in their lowest levels of detail.

• To improve query performance, another type of fact table which stores pre-computed summaries of detailed facts helps.

• Reduced to a logical design solution

Page 13: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

13

Aggregate

• An aggregate fact table is a fact table that summarizes base-level fact table records

along one or several dimensions.• An aggregate dimension table is a

dimension table that summarizes base-level dimension table records.

• E.g., marketing managers check daily product sales by city --- aggregate by city in customer dimension

Page 14: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

14

Aggregate

CUST_CITY# CITY_KEY* CITY* STATE

SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH

ref

ref

ref

ref

ref

refAggregate fact table

Aggregate dimension table

Page 15: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

15

Aggregate

SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY

CUST_CITY# CITY_KEY* CITY* STATE

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

CUSTOMER# CUSTOMER_KEY* CID* CNAME* CITY* STATE

TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH

ref

ref

ref

ref

ref

ref

ref

ref

refref

ref

ref

Page 16: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

16

Indexes

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

How long does it take to

find out the total purchase

Amt by Tom Jones?

Page 17: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

17

Indexes

• Customer table– 1M records, each record 0.200 Kbytes

long– Block is 4K size, block access time is 0.01s– Number of records/block: 4/0.2=20 – Number of blocks: 1M/20=50K

• Sequential search– Time: 25K*0.01s=250s=4min.

Page 18: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

18

Indexes

• Binary search– Time: log(50K)*0.01s=16*0.01s=0.16s

• B+ tree index– Create index pn on customer(cname);– If each node (block) in B+ tree has 117 keys, then

• # of access to indexes: log117(1M)=3 (i.e.height of the tree)

• # of access to Customer Dimension: 1• Total time = 4*0.01 = 0.04s

Page 19: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

19

...(11 key values, 12 pointers)

...

B+-trees - P=12

Indexes to customer records

……….

Indexes to indexes

Page 20: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

20

Indexes

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

How long does it take to

find out the total sales of

Desktop computers?

Page 21: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

21

Performance Improvement

• Suppose there are only 4 product categories for 1M products

• Create a B+ tree index???– Suppose the size of product category

and block ID is 10 bytes– Size of index = 1M * 10 = 10 M bytes

Page 22: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

22

Performance Improvement

• A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.

Page 23: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

23

Bitmaps

Desktop 1 0 1

Notebook 0 1 0

Server 0 0 0

Accessory

0 0 0

Product record 1 record 2 record 3

A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.

Page 24: ACCTG 6910 Building Enterprise &  Business Intelligence Systems (e.bis)

24

Performance Improvement

• Bitmap index is suitable for low cardinality attribute.– Cardinality(A) = # of possible values for A/#of records

• Compared with B+ tree index, bitmap index has the following advantages for low cardinality attributes– Storage space saving (1M*4/8=500K bytes)– Efficient for boolean operations

• CREATE BITMAP INDEX bitpc ON PRODUCT (PCNAME);