decision support systems1 from transaction processing to support for decision making cis 671
TRANSCRIPT
Decision Support Systems 1
From Transaction Processingto
Support for Decision Making
CIS 671
Decision Support Systems 2
Computerized Information Systems
• Used to “run the business”.• OSU Examples
– Personnel & Payroll (ARMS)– Course Offerings– Students, including course enrollments and grades
• (estimated $30M to replace)
– Inventory
• Transaction Processing
Decision Support Systems 3
1st Generation DBMS
• Designed for Transaction Processing– Hierarchical – IBM – IMS– Network
• Management Information Systems – Added later– Mostly standard summary reports
• Produced on a regular basis
Decision Support Systems 4
Relational DBMS
• Codd – particularly designed for “ad hoc” queries
• First uses for Transaction Processing
• Transaction Data now available on-line– Use it to help Decision Making– Ad Hoc
Decision Support Systems 5
Decision Support Systems (DSS)• Use comprehensive view of all aspects of
business.– Different business units– Historical data– Summary information
• Classes of analysis tools:– Complex “traditional” SQL queries– Many “group-by” and “aggregation” queries
(On Line Analytical Processing)– Exploratory data analysis - Data Mining
Decision Support Systems 6
Data Warehousing
• Properties– Consolidated data from many sources– Spanning long time periods– Augmented with summary information
• Size: several gigabytes to terabytes
Decision Support Systems 7
Data Warehouse Creation
• Integrate schemas from different groups– Semantic mismatches
• Different currencies
• Different names for same attributes
• Different structures for similar tables
Decision Support Systems 8
Data Warehouse Creation, cont.• Extract data from different operational
databases and other external sources– Clean data - correct errors, fill in missing data– Transform data to match integrated schema– Load data into warehouse– Refresh data in a timely fashion– Purge very old data– Create metadata repository
• May be so large that it is in a separate database
Decision Support Systems 9
Data Warehouse - Provide Variety of Analytical Tools
– Complex “traditional” SQL queries– OLAP query engine– Data mining algorithm– Information visualization tools– Statistical packages– Report generators
Decision Support Systems 10
Data Mart
• Departmental subset of a data warehouse• Top-down approach
– Derive from the organization’s data warehouse
– May be too hard to do all at once
• Bottom-up approach– Initially create departmental data marts
– Integrate data marts into organizational data warehouse
– If not done carefully, may be hard to integrate
Decision Support Systems 11
OLTP vs. Data Warehouse DBs(from Toby J. Teorey, Database Modeling & Design, Morgan Kaufmann, 1999, p. 212)
OLTP
• Transaction oriented
• Thousands of users
• Small (MB to several GB)
• Current data
• Normalized data (many tables, few columns per table)
• Continuous update
• Simple to complex queries
Data Warehouse
• Subject oriented
• Few users ( 100)
• Large (hundreds of GB to several TB)
• Historical data
• Denormalized data (few tables, many columns per table)
• Batch updates
• Usually very complex queries
Decision Support Systems 12
Complex “traditional” SQL queries
• Relational DBMS optimized for decision support– in contrast to a DBMS optimized for
transaction processing
• Example:– Teradata machine from NCR
Decision Support Systems 13
On Line Analytical Processing (OLAP)
Multidimensional Databases (MDD)
Decision Support Systems 14
Example from Finkelstein [Fink95]:
• Note that
Branch, ProdID, Date Sales, Returns
• Note the multidimensionality of the SALES_INFO table.
SALES_INFOBranch ProdID Date Sales ReturnsBOS 1 1/2/98 $1,000.00 4NY 1 1/2/98 $1,222.00 2CMH 2 1/3/98 $555.00 1SF 2 1/3/98 $1,777.00 9
PROD_INFOProdID Description Category
1 Widget I2 Super Widget II
BRANCH_INFOBranch RegionBOS ANY ACMH BSF C
REGION_INFORegion TerritoryA EastB EastC West
Decision Support Systems 15
Dimension Hierarchies
LOCATION
Territory
Region
Branch
TIMEYear
Quarter
Week Month
Date
PRODUCT
Category
ProdID
Decision Support Systems 16
Possible queries:1. How did product Widget sell in the last
month, and how does this figure compare with sales over the last five years? How about by branch, region and territory?
2. Did this product sell better in different regions, and are there any regional trends?
3. Were there more returns of Widgets over the last year? Were these returns caused by defects? Were they manufactured in any particular plants?
Decision Support Systems 17
Additional Possible query:4. Do commissions and pricing affect how
sales persons sell the product? Do particular salespersons do a better job of selling the product?
Note that a "multidimensional" spreadsheet would be useful.
Codd called this type of problem On Line Analytical Processing (OLAP)
in contrast to On Line Transaction Processing (TP).
Decision Support Systems 18
Codd's rules for OLAP: [Codd93]1. Multi-Dimensional Concept View
The user should be able to see the data as being multidimensional insofar as it should be easy to 'pivot' or 'slice and dice’. (See later.)
2. Transparency
The OLAP functionality should be provided behind the user's existing software without adversely affecting the functionality of the 'host'.
3. Accessibility
OLAP should allow the user to access diverse data stores but see the data within a common 'schema' provided by the OLAP tool.
Decision Support Systems 19
OLAP Rules, cont.4. Consistent Reporting Performance
There should not be significant degradation in performance with large numbers of dimensions or large quantities of data.
5. Client-Server Architecture
Since much of the data is on mainframes, and the users work on PCs, the OLAP tool must be able to bring the two together!
6. Generic Dimensionality
Data dimensions must all be treated equally. Functions available for one dimension must be available for others.
Decision Support Systems 20
OLAP Rules, cont.7. Dynamic Sparse Matrix Handling
The OLAP tool should be able to work out for itself the most efficient way to store sparse matrix data.
8. Multi User SupportThis is self-evident.
9. Unrestricted Cross-Dimensional Operationse.g., individual office overheads are allocated according to total corporate overheads divided in proportion to individual office sales.
Decision Support Systems 21
OLAP Rules, cont.
10. Intuitive Data ManipulationNavigation should be done by operations on individual cells rather than menus.
11. Flexible ReportingRow and column headings must be capable of more than one dimension each, and of displaying subsets of any dimension.
12. Unlimited Dimensions and Aggregation LevelsAt least 15 dimensions may be required, and within each there may be many hierarchical levels.
Decision Support Systems 22
Example from Finkelstein [Fink95]:
• Note that
Branch, ProdID, Date Sales, Returns
• Note the multidimensionality of the SALES_INFO table.
SALES_INFOBranch ProdID Date Sales ReturnsBOS 1 1/2/98 $1,000.00 4NY 1 1/2/98 $1,222.00 2CMH 2 1/3/98 $555.00 1SF 2 1/3/98 $1,777.00 9
PROD_INFOProdID Description Category
1 Widget I2 Super Widget II
BRANCH_INFOBranch RegionBOS ANY ACMH BSF C
REGION_INFORegion TerritoryA EastB EastC West
Decision Support Systems 23
“Pivoting”Cross Tabulation
Sales by Date and Region
RegionA B C Total
Date 1/2/98 $2,222 $0 $0 $2,2221/3/98 $0 $555 $1,777 $2,332Total $2,222 $555 $1,777 $4,554
Decision Support Systems 24
“Drill Down”(narrower category)
Replace Region by Branch.Region
A B C TotalDate 1/2/98 $2,222 $0 $0 $2,222
1/3/98 $0 $555 $1,777 $2,332Total $2,222 $555 $1,777 $4,554
“Rollup” (more general category)Replace Region by Territory.
Decision Support Systems 25
OLAP Questions
1. Query language - how to say what's wanted.
2. Processing language - how to specify calculations: ratios, variances, . . . .
3. Data visualization - how to see the data.
4. Performance - time to process the query (5 second rule).
Decision Support Systems 26
OLAP References
[Codd93] E. F. Codd, S. B. Codd, and C.T. Salley, "Providing OLAP to User Analysts: An IT Mandate," Codd & Date Inc., 1993.
[Fink95] Richard Finkelstein, "MDD: Database Reaches the Next Dimension," DATABASE Programming and Design, 8(4), April 1995.
Decision Support Systems 27
Exploratory Data Analysis Data Mining
• Find interesting trends or patterns in large data sets.
• Statistics - Exploratory Data Analysis
• Artificial Intelligence - Knowledge Discovery and Machine Learning
• Much larger data sets
Decision Support Systems 28
Mining for Association Rules
• Classic example
• Market basket analysis– Record each customer transaction at a grocery
store.– Try and identify sets of items purchased
together.
Decision Support Systems 29
TransID Item111 coke111 chips111 dip112 coke112 chips112 veggies113 coke113 beef113 chicken114 chips114 beef115 chips115 chicken
Association Rule:{coke} {chips}
People who buy coke usually buy chips.
Measures for Association Rule{LHS} {RHS}
• Support: % of transactions containing this set of items. (2/5=40%)
• Confidence: given all transactions containing LHS items, the % that also contain the RHS (2/3=67%)
Want both to be “reasonably” large.
Decision Support Systems 30
On-Line Analytical Processing (OLAP)Part II:
CIS 671
Elmasri & Navathe §26.1
Decision Support Systems 31
Multi-dimensional View of Data
• Fact Table (also called cubes)– Dimension attributes– Dependent attributes (functions of the
dimension attributes)
• Dimension Tables, potentially one for each dimension
Decision Support Systems 32
OLAP Operations
• Roll-up – increase the level of aggregation
• Drill-down - decrease the level of aggregation
• Slice-and-dice - selection and projection,i.e., reduce dimensionality of the data
• Pivot – re-orient the dimensional view
Decision Support Systems 33
Implementation Approaches
• Relational OLAP (ROLAP) Servers– Data stored in a relational
– system
– SQL extended • To allow easy OLAP query expression• To provide efficient OLAP query execution.
• Multidimensional OLAP (MOLAP)– Systems directly store multidimensional data in special data structures
– OLAP operations implemented directly on these data structures.
• Hybrid OLAP (HOLAP)– Combines ROLAP and MOLAP.
– Detail records (largest volume) in relational database.
– Aggregations in separate, but connected”, MOLAP store.
Decision Support Systems 34
Example a Star Schema
OrderNoOrderDate
CustomerNoCustomerNameCustomerAddressCity
SalespersonIDSalespersonNameCityQuota
OrderNoSalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalPrice
CityNameStateRegion
DateKeyDateMonthYear
ProdNoProdNameProdDescrCategoryCategoryDescrUnitPriceQOH
Customer
Order
Salesperson
Sales (Fact) table
Product
Date
City
Decision Support Systems 35
Snowflake Schema
OrderNoOrderDate
CustomerNoCustomerNameCustomerAddressCity
SalespersonIDSalespersonNameCityQuota
OrderNoSalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalPrice
CityNameStateRegion
DateKeyDateMonthYear
ProdNoProdNameProdDescrCategoryUnitPriceQOH
Customer
Order
Salesperson
Sales (Fact) table
Product
Date
City
MonthYear
CategoryNameCategoryDescr
StateRegion
Year
State
Category
Region
Month
Decision Support Systems 36
Data Cubes
• Precompute all possible aggregations.
• Required extra storage is tolerable.
• Little penalty to keep aggregate up-to-date if data does not change.
• Normally some aggregation of raw data is done before it is entered into the data cube.
Decision Support Systems 37
Data Cube with Orders Accumulated
CustomerNoCustomerNameCustomerAddressCity
SalespersonIDSalespersonNameCityQuota
SalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalValue
CityNameState
DateKeyDateMonth
ProdNoProdNameProdDescrCategoryUnitPriceQOH
Customer
Salesperson
Sales table
Product
Date
City
MonthYear
CategoryNameCategoryDescr
StateRegion
Year
State
Category
Region
Month
Note that average for any aggregate can be calculated from TotalValue and Quantity.
Decision Support Systems 38
Sample of Aggregates in the CUBE
Sales(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
22 11 100 2 ‘Columbus’ 3 300
CUBE(Sales)(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
22 11 100 2 ‘Columbus’ 3 30022 * 100 2 ‘Columbus’ 6 222222 * * 2 ‘Columbus’ 25 33000
* * * 2 ‘Columbus’ 75 90000* * * * ‘Columbus’ 200 503444
Decision Support Systems 39
How to answer query given the relation CUBE(Sales)
Choose tuples in CUBE(Sales) with the following properties:
1. Query specifies value v for attribute a tuple t has v in its component for a.
2. Query groups by attribute a tuple t has any non-* value in its component for a.
3. Query has neither groups by attribute a nor specifies value for a
tuple t has * value in its component for a.
Decision Support Systems 40
How to answer query given the relation CUBE(Sales)
Cube(Sales)(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
22 11 100 2 ‘Columbus’ 3 30022 * 100 2 ‘Columbus’ 6 222222 * * 2 ‘Columbus’ 25 33000
* * * 2 ‘Columbus’ 75 90000* * * * ‘Columbus’ 200 503444
select CustomerNo, avg(Price)from Saleswhere SalespersonID = 22Group by CustomerNo
Cube(Sales)(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
22 c * * * n v
Result(c, v/n)
Decision Support Systems 41
Cube Implementation by Materialized Views
• Dimensions may have hierarchies.– Product, Category– City, State, Region
Decision Support Systems 42
Example: Materialized ViewsCube(Sales)(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
insert into SalesV1select SalespersonID, CustomerNo, Month, State
sum(Quantity) as Quantity, sum(TotalValue) as TotalValuefrom Sales join City on Sales.CityName = City.CityNamegroup by SalespersonID, CustomerNo, Month, State;
insert into SalesV2select SalespersonID, CustomerNo, Month, Region
sum(Quantity) as Quantity, sum(TotalValue) as TotalValuefrom Sales join City on Sales.CityName = City.CityNamegroup by SalespersonID, CustomerNo, Month, Region;
City(CityName, State, Region)
Decision Support Systems 43
Example: Query 1select SalespersonID, sum(TotalValue) from Salesgroup by SalespersonID;
select SalespersonID, sum(TotalValue) from SalesV1group by SalespersonID;
select SalespersonID, sum(TotalValue) from SalesV2group by SalespersonID;
Answer by
or by
Decision Support Systems 44
Example: Query 2
select SalespersonID, State, sum(TotalValue) from Salesgroup by SalespersonID, State;
select SalespersonID, State, sum(TotalValue) from SalesV1group by SalespersonID, State;
Answer only by
Decision Support Systems 45
Example: Query 3
select SalespersonID, State, date, sum(TotalValue) from Salesgroup by SalespersonID, State, Date;
Cannot be answered by either SalesV1 or SalesV2.Thus must use Sales itself.
Decision Support Systems 46
Lattice of ViewsAll
Years
Quarters
MonthsWeeks
Days
All
City
State
Region
Decision Support Systems 47
Lattice of Materialized Views and Queries
Sales
Q1
SalesV2SalesV1
Q3Q2
Decision Support Systems 48
OLAP ExampleGarcia-Molina, Ullman & Widom, Database System Implementation,
Prentice Hall, 2000
Automobile Sales Company: analyze sales of cars
Sales(serialNo, date, dealer, price)
Autos(serialNo, model, color)Dealers(name, city, state)
Days(day, week, month, year) ( 5, 27, 7, 2000)
Fact Table
Dimension Tables
Time Dimension Table, probably not
stored
Decision Support Systems 49
Assume a particular car model, say ‘Gobi’, is not selling as well as anticipated.
How to analyze?
Maybe it’s the color. Slice for ‘Gobi. Dice for color.
select color, sum(price)from Sales natural join Autoswhere model = ‘Gobi’group by color;
Doesn’t show anything interesting.
Decision Support Systems 50
Gobi analysis, continuing
What about time? Drill down for month.
select color, month, sum(price)from Sales natural join Autos
join Days on date = daywhere model = ‘Gobi’group by color, month;
Suppose we discover red Gobis have not sold well recently.
Decision Support Systems 51
Gobi analysis, continuing
Are red Gobis selling poorly for all dealers or just some?
Drill down for dealer.
select dealer, month, sum(price)from Sales natural join Autos
join Days on date = daywhere model = ‘Gobi’ and color = ‘red’group by dealer, month;
Discover there are too few sales to show anything interesting.
Decision Support Systems 52
Gobi analysis, continuing
Rollup time from month to year and slice for last two years.
select dealer, year, sum(price)from Sales natural join Autos
join Days on date = daywhere model = ‘Gobi’ and color = ‘red’ and (year = ‘1999’ or year = ‘2000’)group by dealer, year;
Does show variation. Now understand the problem better.
Decision Support Systems 53
Administration
• Lab assignments and HWs posted on the web.• Clarifications/Questions?• Please use appropriate online submit command• Teams of 2 allowed but make contribution of each
team member explicit especially in the lab assignment.
• Extra Credit assignment in lab.• Bring questions to class on Thursday
Decision Support Systems 54
• (color codes , meaning tuple representation (time in quarters, product,country,Tsales)
• time, product, country are dimension attributes, Tsales is total sales
• White squares (basic fact table) - (q, p, c, sales)
• Green squares total annual sales grouped by product and country. (*, p, c, Tsales)
• Dark Green squares total annual sales grouped by product (*, p, *, Tsales)
• Orange squares total annual sales grouped by quarter and country. (q, *, c, Tsales)
• Dark orange squares total annual sales grouped by quarter. (q, *, *, Tsales)
• Grey total annual sales grouped by country. (*, *, c, Tsales)
• Other pair (quarter and product) not shown (need to pivot). (q, *, p, Tsales)
• Dark blue (all sales) (*, *, *, sales)
February 22, 2003 Data Mining: Concepts and Techniques 27
A Sample Data Cube
Total annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntrysum
sumTV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Size of white cube = QXPXC, size of colored cube = (Q+1)X (P+1)X(C+1)Why? (* think of it as another category along each dimensionSize of colored cube with hierarchy Even larger!
Decision Support Systems 55
February 22, 2003 Data Mining: Concepts and Techniques 41
Cube Operation
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year Need compute the following Group-Bys
(date, product, customer),(date,product),(date, customer), (product, customer),(date), (product), (customer)()
(item)(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
Decision Support Systems 56
Aggregation causes Database Explosion
in Large Multi-dimensional Applications as the Number of
Dimensions Increases
Based on Nigel Pendse, “Database Explosion”,
www.olapreport.com/DatabaseExplosion.htm
Decision Support Systems 57
Factors not causing data explosion• Poor handling of data sparsity.
– No more than factor of 4 vs. factors of 10s or 100s
• Type of database technology.– Although optimized storage technology will be significantly
better.
• Lack of data compression.– Compression is helpful, but explosion still occurs.
• Software errors– Again, a different problem.
Decision Support Systems 58
Multi-dimensional Database (MDB)can save significant space
• Keys, indexes & dimensional structures .– Not required or take far less space.
• Sparsity better suppressed.• Data compressed.• Example:
– 6-dimensional (including measures) banking cube– 13 million row fact table– Relational fact table incl. indexes, but not aggregates: 5188 Mb– MOLAP cube including aggregations: 336 Mb– Well under 10% the space.– Much faster query processing.
Decision Support Systems 59
(n+m+p) 2
(n+m) 2
Why is there a data explosion even without sparsity?
• Take two dimensional example
• n: data from original source.
• m: data aggregations precalculated.
• p: on-the-fly results, not stored.
n2
n m p
Simplifying to n=m=p1n2, 4n2, 9n2
In 3 dimensions this becomes1n3, 8n3, 27n3
Decision Support Systems 60
When Data is Sparse it’s much worse.
• One-dimensional data.• Simple hierarchy. Black - actual data, red - nulls.• Detailed level: 8 of 25 or 32%. • Aggregated levels: 5 of 6 or 83%.• Growth factor: 1.625 (13 cells based on 8 input cells)
Decision Support Systems 61
Aggregated data
Two dimensions: The problem gets worse
Detail data
• Potential input cells:25*25=625
• Potential aggregated cells:
6*6 + 6*25 +6*25 =336
• More than 1 derived cell for every 2 possible input cells.
• In 6 dimensions, could have 2 or 3 derived cells per 1 input cell.
Decision Support Systems 62
What about higher dimensions?• One percent density, 6 of 625 input cells.
• Yields 29 computed cells.
• I.e., 35 total cells, only 6 input.
• Growth factor: 5.83.
• Growth factor per dimension: sqrt(5.83)=2.4. – Called compound growth factor (CGF).
• CGF is typically in the range 1.5 to 2.5.
• CGF increases as sparsity increases.
• With large dimensions, will often be more consolidation. – (Many thousands of products more levels of groupings.)
• With CGF of 2.0, extra dimension with no increase in input data, will double size of fully computed database.
Decision Support Systems 63
So what is the problem?
• Disk space increases.
• Can software handle this much data?
• Time to load and update database increases.– Could take days to load the database.
Decision Support Systems 64
What to do?• Avoid fully pre-calculating any multi-
dimensional object with more than 5 sparse dimensions.
• Reduce sparsity of individual data objects:– Use good application design.
What to pre-calculate?• Data that is slow to calculate at run-time because it depends
on many other cells or complex formulae.
• Data that is frequently viewed.
• Data that is the basis of many other calculations.
• Note: If too much is precalculated, performance may decrease because cache will not include as much useful data.