on-line analytical processing salman azhar

42
2/10/05 Salman Azhar: Database Sy stems 1 On-Line Analytical Processing Salman Azhar Warehousing Data Cubes Data Mining These slides use some figures, definitions, and explanations from Elmasri-Navathe’s Fundamentals of Database Systems and Molina-Ullman-Widom’s Database Systems

Upload: rana-vance

Post on 05-Jan-2016

42 views

Category:

Documents


3 download

DESCRIPTION

On-Line Analytical Processing Salman Azhar. Warehousing Data Cubes Data Mining. These slides use some figures, definitions, and explanations from Elmasri-Navathe’s Fundamentals of Database Systems and Molina-Ullman-Widom’s Database Systems. Overview. Traditional database systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

1

On-Line Analytical Processing

Salman Azhar

WarehousingData CubesData Mining

These slides use some figures, definitions, and explanations from Elmasri-Navathe’s Fundamentals of Database Systems

and Molina-Ullman-Widom’s Database Systems

Page 2: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

2

Overview

Traditional database systems tuned to many, small, simple queries

Some newer “analytic” applications fewer, more time-consuming, complex

queries New architectures

developed to handle complex “analytic” queries efficiently

Page 3: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

3

The Data Warehouse

The most common form of data integration: Copy sources into a single DB

(warehouse) and try to keep it up-to-date

Usual method: periodic reconstruction of the warehouse, perhaps overnight

Warehouse essential for analytic queries

Page 4: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

4

OLTP Most database operations involve

On-Line Transaction Processing (OTLP). Short, simple, frequent queries and/or

modifications Each involving a small number of tuples.

Examples… : Looking up a phone number on the web Sales at cash registers Selling airline tickets

Page 5: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

5

OLAP

Increasing importance of On-Line Application Processing (OLAP) queries Few, but complex queries --- may run

for hours. Queries do not depend on having an

absolutely up-to-date database. Sometimes called Data Mining

Page 6: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

6

OLAP Examples

1. Amazon analyzes customer purchases by its customers to recommend with products of likely interest

Compares purchases between customers Takes longer than customers are willing to

wait

2. Wal-Mart looks for items with sales trends in a region or time period

Presents data to vendors Used to determine ordering and inventory

Page 7: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

7

Common Architecture

Databases at branches handle OLTP

Local databases copied to a central warehouse overnight (or periodically)

Analysts use the warehouse for OLAP

OLTP

OLTP

OLTP

OLTP

OLAP

AnalystsTransaction Users

Page 8: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

8

Data Warehouse

Data Warehouse

Data Access

User Data AccessData

Sources

Data Input

StagingArea

Data Marts

Page 9: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

9

Star Schemas A star schema

common organization for data at a warehouse

It consists of… Fact table :

a very large accumulation of facts such as sales often “insert-only”

Dimension tables : smaller, generally static information about the entities involved in the facts

Page 10: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

10

StarSchema

Fact TableDimension Table

Employee_DimEmployee_DimEmployee_DimEmployee_DimEmployeeKeyEmployeeKey

EmployeeID...EmployeeID...

Time_DimTime_DimTime_DimTime_DimTimeKeyTimeKey

TheDate...TheDate...

Product_DimProduct_DimProduct_DimProduct_DimProductKeyProductKey

ProductID...ProductID...

Customer_DimCustomer_DimCustomer_DimCustomer_DimCustomerKeyCustomerKey

CustomerID...CustomerID...

Shipper_DimShipper_DimShipper_DimShipper_DimShipperKeyShipperKey

ShipperID...ShipperID...

Sales_FactSales_FactTimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey

TimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey

Sales AmountUnit Sales ...Sales AmountUnit Sales ...

Page 11: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

11

Example: Star Schema

Suppose we want to record in a warehouse information about every car sale: dealer, car, buyer, day, time, price

paid The fact table is a relation:

Sale(dealer, model, buyer, day, time, price)

Page 12: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

12

Example, Continued The dimension tables include

information about the dealer, car, and buyer “dimensions”: Dealer(dealer, city, zip) Car(model, manufacturer) Buyer(buyer, city, phone)

Recall the fact table: Sale(dealer, model, buyer, day, time,

price)

Page 13: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

13

Dimensions and Dependent Attributes Two classes of fact-table attributes:

Dimension attributes : the key of a dimension table Sale(dealer, model, buyer, day, time, price)

Dependent attributes : a value determined by the dimension

attributes of the row Sale(dealer, model, buyer, day, time, price) E.g., price determined by the combination

of dealer, model, buyer, day, time

Page 14: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

14

Example: Dependent Attribute

price is determined by the combination of dimension

attributes: dealer, car, buyer, and the time

(combination of day and time attributes).

Page 15: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

15

Approaches to Building Warehouses

ROLAP = “relational OLAP”: Tune a relational DBMS to support

star schemas MOLAP = “multidimensional

OLAP”: Use a specialized DBMS with a

model such as the “data cube”

Page 16: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

16

ROLAP Techniques Bitmap indexes :

For each key value of a dimension table (e.g., each model for relation Cars)

create a bit-vector telling which tuples of the fact table have that value

Materialized views : Store the answers to several useful

queries (views) in the warehouse itself Stored views!

Page 17: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

17

Typical OLAP Queries Often, OLAP queries begin with a “star join”:

the natural join of the fact table with all or most of the dimension tables

Recall the tables:Sales(dealer, model, buyer, day, time, price)

Dealers(dealer, city, zip) Cars(model, manufacturer) Buyers(buyer, city, phone)

Example:SELECT * FROM Sales, Dealers, Cars, BuyersWHERE Sales.dealer = Dealers.dealer AND

Sales.model = Cars.model ANDSales.buyer = Buyers.buyer;

Page 18: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

18

Typical OLAP Queries --- 2

The typical OLAP query will:1. Start with a star join2. Select for interesting tuples, based

on dimension data3. Group by one or more dimensions4. Aggregate certain attributes of the

result

Page 19: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

19

Example: OLAP Query For each dealer in Indianapolis

find the total sales of each car manufactured by BMW

Filter: city = “Indianapolis” manf = “BMW”

Grouping: by dealer and car

Aggregation: Sum of price

GROUP EXERCISE:Write the SQL Query

Note: Do not turn over to the next page before

attempting this exercise yourself!

Page 20: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

20

Example: In SQL

SELECT dealer, model, SUM(price)

FROM Sales NATURAL JOIN Dealers

NATURAL JOIN Cars

WHERE Dealer.city = ’Indianapolis’

AND Car.manf = ’BMW’

GROUP BY dealer, model;

Page 21: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

21

Using Materialized Views

A direct execution of this query from Sales and the dimension tables could take too long

If we create a materialized view that contains enough information, we may be able to answer our query

much faster

Page 22: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

22

Example: Materialized View

Which views could help with our query? Key issues:

1. It must join Sales, Dealers, and Cars, at least2. It must group by at least dealer and car3. It must not select out Indianapolis Dealers or

BMW Cars4. It must not project out city or manf

Page 23: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

23

Example --- Continued Here is a materialized view that could help:

CREATE VIEW vSales(dealer, city,

car, manf, sales) AS

SELECT dealer, city, model, manf,

SUM(price) sales

FROM Sales NATURAL JOIN Dealers

NATURAL JOIN Cars

GROUP BY dealer, city, model, manf;

Since dealer -> city and model -> manf, there is no real grouping.We need city and manf in the SELECT.

Page 24: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

24

Example --- Concluded

Here’s our query using the materialized view vSales:

SELECT dealer, car, sales

FROM vSales

WHERE city = ’Indianapolis’

AND manf = ’BMW’;

Page 25: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

25

MOLAP and Data Cubes

Keys of dimension tables are the dimensions of a hypercube Example: for the Sales data, the four

dimensions are Dealers, Cars, Buyers, and time

Dependent attributes (e.g., price) appear at the points of the cube

Page 26: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

26

Defining a Cube

Q4Q1 Q2 Q3Time Dimension Pro

ducts

Dim

ensio

n

Detroit

Denver

Chicago

Mar

ket D

imen

sion

Apples

CherriesGrapes

Atlanta

Melons

Page 27: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

27

Q4Q1 Q2 Q3Time Dimension

Produ

cts D

imen

sionDallas

Denver

Chicago

Mar

kets

Dim

ensi

on

Apples

CherriesGrapes

AtlantaSales Fact

Melons

Querying a Cube

Page 28: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

28

ApplesQ4Q1 Q2 Q3

Time Dimension

Produ

cts D

imen

sion

Detroit

Denver

Chicago

Atlanta

Mar

kets

Dim

ensi

on

MelonsCherries

Grapes

Defining a Cube Slice

Page 29: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

29

Working with Dimensions and Hierarchies Dimensions Allow You to

Slice Dice

Hierarchies Allow You to

Drill Down Drill Up

Page 30: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

30

Marginals

The data cube also includes aggregation (typically SUM) along the margins of the cube

The marginals include aggregations over one dimension, two

dimensions,…

Page 31: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

31

Example: Marginals

Our 4-dimensional Sales cube includes the sum of price over each dealer, each

car, each buyer, and each time unit (perhaps days)

It would also have the sum of price over all dealer-model pairs, all dealer-buyer-

day triples,…

Page 32: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

32

Structure of the Cube Think of each dimension as having an

additional value * A point with one or more *’s in its

coordinates aggregates over the dimensions with the *’s.

Example: Sales(“Auto Nation”, “Mini Cooper”, *, *)

holds the sum over all Buyers and all time of the Mini Coopers bought at AutoNation

Page 33: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

33

Drill-Down

Drill-down = “de-aggregate” = break an aggregate into its

constituents Example:

having determined that Auto Nation sells very few BMW Cars,

break down his sales by particular car

Page 34: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

34

Roll-Up

Roll-up = aggregate along one or more

dimensions. Example:

given a table of how many Mini Coopers each buyer buys at each dealer,

roll it up into a table giving total number of Mini Coopers bought by each buyer

Page 35: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

35

Materialized Data-Cube Views

Data cubes invite materialized views that are aggregations in one or more dimensions

Dimensions may not be completely aggregated an option is to group by an attribute

of the dimension table

Page 36: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

36

Example

A materialized view for our Sales data cube might:

1. Aggregate by buyer completely2. Not aggregate at all by car3. Aggregate by time according to the

week4. Aggregate according to the city of

the dealer

Page 37: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

37

Data Mining

Data mining is a popular term for queries that summarize big data sets in useful ways

Examples:1. Clustering all Web pages by topic2. Finding characteristics of fraudulent

credit-card use

Page 38: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

38

Market-Basket Data

An important form of mining from relational data involves market baskets sets of “items” that are purchased

together as a customer leaves a store Summary of basket data is frequent

itemsets sets of items that often appear together

in baskets

Page 39: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

39

Example: Market Baskets

If people often buy bread and butter together, the store can:

1. Put bread and butter near each other and put potato chips between the two

2. Run a sale on bread and raise the price of butter

Page 40: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

40

Finding Frequent Pairs The simplest case is when we only

want to find “frequent pairs” of items. Assume data is in a relation

Baskets(basket, item) The support thresholds is the

minimum number of baskets in which a pair appears before we are interested

Page 41: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

41

Frequent Pairs in SQL

SELECT b1.item, b2.item

FROM Baskets b1, Baskets b2

WHERE b1.basket = b2.basket

AND b1.item < b2.item

GROUP BY b1.item, b2.item

HAVING COUNT(*) >= s;

Look for twoBasket tupleswith the same

basket anddifferent items.First item must

precede second,so we don’t

count the samepair twice.

Create a group foreach pair of itemsthat appears in atleast one basket.

Throw away pairs of itemsthat do not appear at least

s times.

Page 42: On-Line Analytical Processing  Salman Azhar

2/10/05 Salman Azhar: Database Systems

42

Summary

OLAP vs. OLTP Two different worlds

Warehousing Data Cubes Data Mining Materialized views Storing aggregate data