on-line application processing

41
On-Line Application On-Line Application Processing Processing Warehousing Warehousing Data Cubes Data Cubes (Data Mining) (Data Mining) (slides borrowed from Stanford) (slides borrowed from Stanford)

Upload: caesar-conley

Post on 04-Jan-2016

32 views

Category:

Documents


1 download

DESCRIPTION

On-Line Application Processing. Warehousing Data Cubes (Data Mining) (slides borrowed from Stanford). Overview. Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, complex queries. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On-Line Application Processing

On-Line Application On-Line Application ProcessingProcessing

WarehousingWarehousingData CubesData Cubes

(Data Mining)(Data Mining)

(slides borrowed from Stanford)(slides borrowed from Stanford)

Page 2: On-Line Application Processing

OverviewOverview

Traditional database systems are Traditional database systems are tuned to many, small, simple queries.tuned to many, small, simple queries.

Some new applications use fewer, Some new applications use fewer, more time-consuming, complex more time-consuming, complex queries.queries.

New architectures have been New architectures have been developed to handle complex developed to handle complex “analytic” queries efficiently.“analytic” queries efficiently.

Page 3: On-Line Application Processing

The Data WarehouseThe Data Warehouse

The most common form of data The most common form of data integration.integration. Copy sources into a single DB Copy sources into a single DB

((warehousewarehouse) and try to keep it up-to-) and try to keep it up-to-date.date.

Usual method: periodic reconstruction of Usual method: periodic reconstruction of the warehouse, perhaps overnight.the warehouse, perhaps overnight.

Frequently essential for analytic queries.Frequently essential for analytic queries.

Page 4: On-Line Application Processing

OLTPOLTP

Most database operations involve Most database operations involve On-Line Transaction ProcessingOn-Line Transaction Processing (OTLP).(OTLP). Short, simple, frequent queries and/or Short, simple, frequent queries and/or

modifications, each involving a small modifications, each involving a small number of tuples.number of tuples.

Examples: Answering queries from a Examples: Answering queries from a Web interface, sales at cash registers, Web interface, sales at cash registers, selling airline tickets.selling airline tickets.

Page 5: On-Line Application Processing

OLAPOLAP

Of increasing importance are Of increasing importance are On-Line On-Line Application ProcessingApplication Processing (OLAP) (OLAP) queries.queries. Few, but complex queries --- may run for Few, but complex queries --- may run for

hours.hours. Queries do not depend on having an Queries do not depend on having an

absolutely up-to-date database.absolutely up-to-date database.

Page 6: On-Line Application Processing

OLAP ExamplesOLAP Examples

1.1. Amazon analyzes purchases by its Amazon analyzes purchases by its customers to come up with an customers to come up with an individual screen with products of individual screen with products of likely interest to the customer.likely interest to the customer.

2.2. Analysts at Wal-Mart look for items Analysts at Wal-Mart look for items with increasing sales in some with increasing sales in some region.region.

Page 7: On-Line Application Processing

Common ArchitectureCommon Architecture

Databases at store branches handle Databases at store branches handle OLTP.OLTP.

Local store databases copied to a Local store databases copied to a central warehouse overnight.central warehouse overnight.

Analysts use the warehouse for Analysts use the warehouse for OLAP.OLAP.

Page 8: On-Line Application Processing

Loading the Data Loading the Data WarehouseWarehouse

Source Systems Data Staging Area Data Warehouse

(OLTP)

Data is periodically extracted

Data is cleansed and transformed

Users query the data warehouse

Page 9: On-Line Application Processing

Terminology: ETLTerminology: ETL

ETL = ETL = EExtraction, xtraction, TTransformation, & ransformation, & LLoadoad Extraction: Get the data out of the Extraction: Get the data out of the

source systemssource systems Transformation: Convert the data into a Transformation: Convert the data into a

useful format for analysisuseful format for analysis Load: Get the data into the data Load: Get the data into the data

warehouse warehouse (…and build indexes, materialized views, etc.)(…and build indexes, materialized views, etc.)

Page 10: On-Line Application Processing

Data Integration is HardData Integration is Hard

Data warehouses combine data from multiple Data warehouses combine data from multiple sourcessources

Data must be translated into a consistent formatData must be translated into a consistent format Data integration represents ~80% of effort for a Data integration represents ~80% of effort for a

typical data warehouse project!typical data warehouse project! Some reasons why it’s hard:Some reasons why it’s hard:

Metadata is often poor or non-existentMetadata is often poor or non-existent Data quality is often badData quality is often bad

Missing or default valuesMissing or default values Multiple spellings of the same thing Multiple spellings of the same thing

(Cal vs. UC Berkeley vs. University of California)(Cal vs. UC Berkeley vs. University of California) Inconsistent semanticsInconsistent semantics

What is an airline passenger?What is an airline passenger?

Page 11: On-Line Application Processing

Federated DatabasesFederated Databases

An alternative to data warehousesAn alternative to data warehouses Data warehouseData warehouse

Create a copy of all the data Create a copy of all the data Execute queries against the copyExecute queries against the copy

Federated database Federated database Pull data from source systems as needed to answer queriesPull data from source systems as needed to answer queries

““lazy” vs. “eager” data integrationlazy” vs. “eager” data integration

Data Warehouse Federated Database

Query

Answer

QueryExtraction

Rewritten Queries

Answer

SourceSystems

SourceSystems

WarehouseMediator

Page 12: On-Line Application Processing

Star SchemasStar Schemas

A A star schemastar schema is a common is a common organization for data at a organization for data at a warehouse. It consists of:warehouse. It consists of:

1.1. Fact tableFact table : a very large accumulation of : a very large accumulation of facts such as sales.facts such as sales.

Often “insert-only.”Often “insert-only.”

2.2. Dimension tablesDimension tables : smaller, generally : smaller, generally static information about the entities static information about the entities involved in the facts.involved in the facts.

Page 13: On-Line Application Processing

Example: Star SchemaExample: Star Schema

Suppose we want to record in a Suppose we want to record in a warehouse information about every warehouse information about every beer sale: the bar, the brand of beer, beer sale: the bar, the brand of beer, the drinker who bought the beer, the the drinker who bought the beer, the day, the time, and the price charged.day, the time, and the price charged.

The fact table is a relation:The fact table is a relation:

Sales(bar, beer, drinker, day, time, Sales(bar, beer, drinker, day, time, price)price)

Page 14: On-Line Application Processing

Example, ContinuedExample, Continued

The dimension tables include The dimension tables include information about the bar, beer, and information about the bar, beer, and drinker “dimensions”:drinker “dimensions”:

Bars(bar, addr, license)Bars(bar, addr, license)

Beers(beer, manf)Beers(beer, manf)

Drinkers(drinker, addr, phone)Drinkers(drinker, addr, phone)

Page 15: On-Line Application Processing

Visualization – Star Visualization – Star SchemaSchema

Dimension Table (Beers) Dimension Table (etc.)

Dimension Table (Drinkers)Dimension Table (Bars)

Fact Table - Sales

Dimension Attrs. Dependent Attrs.

Page 16: On-Line Application Processing

Dimensions and Dependent Dimensions and Dependent AttributesAttributes

Two classes of fact-table attributes:Two classes of fact-table attributes:1.1. Dimension attributesDimension attributes : the key of a : the key of a

dimension table.dimension table.

2.2. Dependent attributesDependent attributes : a value : a value determined by the dimension determined by the dimension attributes of the tuple.attributes of the tuple.

Page 17: On-Line Application Processing

Example: Dependent Example: Dependent AttributeAttribute

priceprice is the dependent attribute of is the dependent attribute of our example Sales relation.our example Sales relation.

It is determined by the combination It is determined by the combination of dimension attributes: of dimension attributes: barbar, , beerbeer, , drinkerdrinker, and the , and the timetime (combination of (combination of day and time-of-day attributes).day and time-of-day attributes).

Page 18: On-Line Application Processing

Comparing Facts and Comparing Facts and DimensionsDimensions

NarrowNarrow Big (many rows)Big (many rows) NumericNumeric Growing over timeGrowing over time

WideWide Small (few rows)Small (few rows) DescriptiveDescriptive StaticStatic

Facts Dimensions

Facts contain numbers, dimensions contain labels

Page 19: On-Line Application Processing

Cross Tabulation of Cross Tabulation of salessales by by item-item-name name and and colorcolor

The table above is an example of a The table above is an example of a cross-cross-tabulationtabulation ( (cross-tabcross-tab), also referred to as a ), also referred to as a pivot-tablepivot-table..

A cross-tab is a table whereA cross-tab is a table where values for one of the dimension attributes form the row headers, values for one of the dimension attributes form the row headers,

values for another dimension attribute form the column headersvalues for another dimension attribute form the column headers Values in individual cells are (aggregates of)Values in individual cells are (aggregates of) the values of the the values of the

dimension attributes that specify the cell.dimension attributes that specify the cell.

Page 20: On-Line Application Processing

MarginalsMarginals

The data cube also includes The data cube also includes aggregation (typically SUM) along aggregation (typically SUM) along the margins of the cube.the margins of the cube.

The The marginalsmarginals include aggregations include aggregations over one dimension, two dimensions,over one dimension, two dimensions,……

Page 21: On-Line Application Processing

Visualization - Data Cube w/ Visualization - Data Cube w/ AggregationAggregation

price

bar

beer

drinkerSU

M o

ver

all D

rinke

rs

Page 22: On-Line Application Processing

Example: MarginalsExample: Marginals

Our 4-dimensional Our 4-dimensional SalesSales cube cube includes the sum of includes the sum of priceprice over each over each bar, each beer, each drinker, and bar, each beer, each drinker, and each time unit (perhaps days).each time unit (perhaps days).

It would also have the sum of It would also have the sum of priceprice over all bar-beer pairs, all bar-over all bar-beer pairs, all bar-drinker-day triples,…drinker-day triples,…

Page 23: On-Line Application Processing

Structure of the CubeStructure of the Cube

Think of each dimension as having Think of each dimension as having an additional value *.an additional value *.

A point with one or more *’s in its A point with one or more *’s in its coordinates aggregates over the coordinates aggregates over the dimensions with the *’s.dimensions with the *’s.

Example: Sales(“Joe’s Bar”, “Bud”, Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers *, *) holds the sum over all drinkers and all time of the Bud consumed at and all time of the Bud consumed at Joe’s. Joe’s.

Page 24: On-Line Application Processing

Relational RepresentationRelational Representation

Crosstabs can be represented as relations The value all is used to

represent aggregates The SQL:1999 standard

actually uses null values in place of all

Page 25: On-Line Application Processing

Three-Dimensional Data Three-Dimensional Data CubeCube A data cube is a multidimensional generalization of a crosstab

Cannot view a three-dimensional object in its entirety but crosstabs can be used as views on a data cube

Page 26: On-Line Application Processing

Data CubeData Cube

Axes of the cube Axes of the cube represent attributes of represent attributes of the data recordsthe data records e.g. color, month, statee.g. color, month, state Called Called dimensionsdimensions

Cells hold aggregated Cells hold aggregated measurements measurements e.g. total $ sales, e.g. total $ sales,

number of autos soldnumber of autos sold Called Called factsfacts

Real data cubes have Real data cubes have >> 3 dimensions>> 3 dimensions

Jul Aug SepCA

ORWA

Red

Blue

Gray

Auto Sales

Page 27: On-Line Application Processing

Slicing and DicingSlicing and Dicing

Jul Aug SepCA

ORWA

Red

Blue

Gray

Red

Blue

Gray

Jul Aug SepCA

ORWA

Blue

Jul Aug SepCA

ORWA

Blue

Jul Aug SepTotal

Page 28: On-Line Application Processing

Querying the Data CubeQuerying the Data Cube

Cross-tabulationCross-tabulation ““Cross-tab” for shortCross-tab” for short Report data grouped by 2 Report data grouped by 2

dimensionsdimensions Aggregate across other Aggregate across other

dimensionsdimensions Include subtotalsInclude subtotals

Operations on a cross-tabOperations on a cross-tab Roll up (further Roll up (further

aggregation)aggregation) Drill down (less Drill down (less

aggregation)aggregation)

CACA OROR WAWA TotalTotal

JulJul 4545 3333 3030 108108

AugAug 5050 3636 4242 128128

SepSep 3838 3131 4040 109109

TotalTotal 133133 100100 112112 345345

Number of Autos Sold

Page 29: On-Line Application Processing

Roll Up and Drill DownRoll Up and Drill Down

CACA OROR WAWA TotaTotall

JulJul 4545 3333 3030 108108

AugAug 5050 3636 4242 128128

SepSep 3838 3131 4040 109109

TotaTotall

133133 100100 112112 345345

Number of Autos Sold

CACA OROR WAWA TotalTotal

133133 100100 112112 345345

Number of Autos Sold

CACA OROR WAWA TotaTotall

RedRed 4040 2929 4040 109109

BlueBlue 4545 3131 3737 113113

GraGrayy

4848 4040 3535 123123

TotaTotall

133133 100100 112112 345345

Roll upby Month

Number of Autos Sold

Drill downby Color

Page 30: On-Line Application Processing

Full Data Cube with Full Data Cube with SubtotalsSubtotals

Pre-computation of aggregates Pre-computation of aggregates → → fast fast answers to OLAP queriesanswers to OLAP queries

Ideally, pre-compute all 2Ideally, pre-compute all 2nn types of types of subtotalssubtotals

Otherwise, perform aggregation as neededOtherwise, perform aggregation as needed Coarser-grained totals can be computed Coarser-grained totals can be computed

from finer-grained totalsfrom finer-grained totals But not the other way aroundBut not the other way around

Page 31: On-Line Application Processing

Data Cube LatticeData Cube Lattice

Total

State Month Color

State, Month

State,Color

Month,Color

State, Month, Color

DrillDown

RollUp

Page 32: On-Line Application Processing

MOLAP vs. ROLAPMOLAP vs. ROLAP

MOLAP = Multidimensional OLAPMOLAP = Multidimensional OLAP Store data cube as multidimensional arrayStore data cube as multidimensional array (Usually) pre-compute all aggregates(Usually) pre-compute all aggregates Advantages:Advantages:

Very efficient data access Very efficient data access →→ fast answers fast answers Disadvantages:Disadvantages:

Doesn’t scale to large numbers of dimensionsDoesn’t scale to large numbers of dimensions Requires special-purpose data storeRequires special-purpose data store

Page 33: On-Line Application Processing

SparsitySparsity

Imagine a data warehouse for Safeway.Imagine a data warehouse for Safeway. Suppose dimensions are: Customer, Product, Store, DaySuppose dimensions are: Customer, Product, Store, Day If there are 100,000 customers, 10,000 products, 1,000 If there are 100,000 customers, 10,000 products, 1,000

stores, and 1,000 days…stores, and 1,000 days… ……data cube has 1,000,000,000,000,000 cells!data cube has 1,000,000,000,000,000 cells! Fortunately, most cells are empty.Fortunately, most cells are empty. A given store doesn’t sell every product on every day.A given store doesn’t sell every product on every day. A given customer has never visited most of the stores.A given customer has never visited most of the stores. A given customer has never purchased most products.A given customer has never purchased most products. Multi-dimensional arrays are not an efficient way to Multi-dimensional arrays are not an efficient way to

store sparse data.store sparse data.

Page 34: On-Line Application Processing

MOLAP vs. ROLAPMOLAP vs. ROLAP

ROLAP = Relational OLAPROLAP = Relational OLAP Store data cube in relational databaseStore data cube in relational database Express queries in SQLExpress queries in SQL Advantages:Advantages:

Scales well to high dimensionalityScales well to high dimensionality Scales well to large data setsScales well to large data sets Sparsity is not a problemSparsity is not a problem Uses well-known, mature technologyUses well-known, mature technology

Disadvantages:Disadvantages: Query performance is slower than MOLAPQuery performance is slower than MOLAP Need to construct explicit indexesNeed to construct explicit indexes

Page 35: On-Line Application Processing

Creating a Cross-tab with Creating a Cross-tab with SQLSQL

SELECT state, month, SUM(quantity)FROM salesGROUP BY state, monthWHERE color = 'Red'

Grouping Attributes

Measurements

Filters

Page 36: On-Line Application Processing

What about the totals?What about the totals?

SQL aggregation query SQL aggregation query with GROUP BY does not with GROUP BY does not produce subtotals, totalsproduce subtotals, totals

Our cross-tab report is Our cross-tab report is incomplete.incomplete.

CACA OROR WAWA TotalTotal

JulJul 4545 3333 3030 ??

AugAug 5050 3636 4242 ??

SepSep 3838 3131 4040 ??

TotalTotal ?? ?? ?? ??

Number of Autos Sold

State Month SUMCA Jul 45CA Aug 50CA Sep 38OR Jul 33OR Aug 36OR Sep 31WA Jul 30WA Aug 42WA Sep 40

Page 37: On-Line Application Processing

One solution: a big UNION One solution: a big UNION ALLALLSELECT state, month, SUM(quantity)FROM salesGROUP BY state, monthWHERE color = 'Red‘UNION ALLSELECT state, "ALL", SUM(quantity)FROM salesGROUP BY stateWHERE color = 'Red'UNION ALLSELECT "ALL", month, SUM(quantity)FROM salesGROUP BY monthWHERE color = 'Red‘UNION ALLSELECT "ALL", "ALL", SUM(quantity)FROM salesWHERE color = 'Red'

OriginalQuery

StateSubtotals

MonthSubtotals

OverallTotal

Page 38: On-Line Application Processing

A better solutionA better solution

““UNION ALL” solution gets cumbersome with UNION ALL” solution gets cumbersome with more than 2 grouping attributesmore than 2 grouping attributes

n grouping attributes → 2n grouping attributes → 2nn parts in the union parts in the union OLAP extensions added to SQL 99 are more OLAP extensions added to SQL 99 are more

convenientconvenient CUBE, ROLLUPCUBE, ROLLUP

SELECT state, month, SUM(quantity)FROM salesGROUP BY CUBE(state, month)WHERE color = 'Red'

Page 39: On-Line Application Processing

Results of the CUBE queryResults of the CUBE queryState MonthSUM(quantity)CA Jul 45CA Aug 50CA Sep 38CA NULL 133OR Jul 33OR Aug 36OR Sep 31OR NULL 100WA Jul 30WA Aug 42WA Sep 40WA NULL 112NULL Jul 108NULL Aug 128NULL Sep 109NULL NULL 345

Notice the use of NULL for totals

Subtotals at all levels

Page 40: On-Line Application Processing

ROLLUP vs. CUBEROLLUP vs. CUBE CUBE computes entire latticeCUBE computes entire lattice ROLLUP computes one path through latticeROLLUP computes one path through lattice

Order of GROUP BY list mattersOrder of GROUP BY list matters Groups by all prefixes of the GROUP BY listGroups by all prefixes of the GROUP BY list

GROUP BY ROLLUP(A,B,C)•A,B,C•(A,B) subtotals•(A) subtotals•Total

GROUP BY CUBE(A,B,C)•A,B,C•Subtotals for the following:(A,B), (A,C), (B,C), (A), (B), (C)•Total

Page 41: On-Line Application Processing

ROLLUP exampleROLLUP example

Total

State Month Color

State, Month

State,Color

Month,Color

State, Month, Color

SELECT color, month, state, SUM(quantity)FROM salesGROUP BY ROLLUP(color,month,state)