cs 345: topics in data warehousing thursday, october 14, 2004
TRANSCRIPT
![Page 1: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/1.jpg)
CS 345:Topics in Data Warehousing
Thursday, October 14, 2004
![Page 2: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/2.jpg)
Review of Tuesday’s Class• Customer Relationship Management (CRM)• Dimension-focused queries
– Drill-across– Conformed dimensions
• Customer dimension– Behavioral attributes– Auxiliary tables
• Techniques for very large dimensions– Outriggers– Mini-dimensions
![Page 3: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/3.jpg)
Outline of Today’s Class
• Bridge tables– Hierarchies– Multi-Valued Dimensions
• Extraction-Transformation-Load– Data staging area vs. data warehouse– Assigning surrogate keys– Detecting changed rows– Detecting duplicate dimension rows
![Page 4: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/4.jpg)
More Outriggers / Mini-Dims• Lots of information about some customers, little info about others
– A common scenario• Example: web site browsing behavior
– Web User dimension (= Customer dimension)– Unregistered users
• User identity tracked over time via cookies• Limited information available
– First active date, Latest active date, Behavioral attributes– Possibly ZIP code through IP lookup
– Registered users• Lots of data provided by user during registration
– Many more unregistered users than registered users– Most attribute values are unknown for unregistered users
• Split registered user attributes into a separate table– Either an outrigger or a mini-dimension– For unregistered users, point to special “Unregistered” row
![Page 5: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/5.jpg)
Handling Hierarchies• Hierarchical relationships among dimension attributes are common• There are various ways to handle hierarchies
– Store all levels of hierarchy in denormalized dimension table• The preferred solution in almost all cases!
– Create “snowflake” schema with hierarchy captured in separate outrigger table
• Only recommended for huge dimension tables• Storage savings have negligible impact in most cases
• What about variable-depth hierarchies?– Examples:
• Corporate organization chart• Parts composed of subparts
– Previous two solutions assumed fixed-depth– Creating recursive foreign key to parent row is a possibility
• Employee dimension has “boss” attribute which is FK to Employee• The CEO has NULL value for boss• This approach is not recommended• Cannot be queried effectively using SQL
– Alternative approach: bridge table
![Page 6: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/6.jpg)
Bridge Tables• Customer dimension has one row
for each customer entity at any level of the hierarchy
• Separate bridge table has schema:
– Parent customer key– Subsidiary customer key– Depth of subsidiary– Bottom flag– Top flag
• One row in bridge table for every (ancestor, descendant) pair
– Customer counts as its own Depth-0 ancestor
– 16 rows for the hierarchy at right• Fact table can join:
– Directly to customer dimension– Through bridge table to customer
dimension
Customer 1
Customer 2
Customer 3
Customer 5
Customer 6
Customer 4
Customer 7
Fact Bridge Customer
cust_id
parent_id
child_id
cust_id
![Page 7: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/7.jpg)
Bridge Table Example
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y
… … … … …
![Page 8: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/8.jpg)
Using Bridge Tables in Queries• Two join directions
– Navigate up the hierarchy• Fact joins to subsidiary customer key• Dimension joins to parent customer key
– Navigate down the hierarchy• Fact joins to parent customer key• Dimension joins to subsidiary customer key
• Safe uses of the bridge table:– Filter on customer dimension restricts query to a single customer
• Use bridge table to combine data about that customer’s subsidiaries or parents
– Filter on bridge table restricts query to a single level• Require Top Flag = Y• Require Depth = 1
– For immediate parent / child organizations• Require (Depth = 1 OR (Depth < 1 AND Top Flag = Y))
– Generalizes the previous example to properly treat top-level customers
• Other uses of the bridge table risk over-counting– Bridge table is many-to-many between fact and dimension
![Page 9: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/9.jpg)
Restricting to One Customer
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y
… … … … …
![Page 10: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/10.jpg)
Restricting to One Depth
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y
… … … … …
![Page 11: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/11.jpg)
Multi-Valued Dimensions
• Occasionally a dimension takes on a variable number of multiple values– Example: Bank accounts may be owned by one, two,
or even more customers (individual vs. joint accounts)
• Can be modeled using a bridge table• Bank transaction fact table
– Grain: one row per transaction– Dimensions: Date, Branch, TransType, Account,
Customer– Including Customer dimension would violate the grain
![Page 12: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/12.jpg)
Multi-Valued Dimensions
account_id account_id
customer_id
account_id
customer_id
weight
FactTable
BridgeTable
AccountDimension
CustomerDimension
Account-relatedattributes
Customer-relatedattributes
• Weights for each account sum to 1• Allows for proper allocation of facts when using Customer dimension
![Page 13: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/13.jpg)
Weighted Report vs. Impact Report
• Two formulations for customer queries• Weighted report
– Multiply all facts by weight before aggregating– SUM(DollarAmt * weight)– Subtotals and totals are meaningful
• Impact report– Don’t use the weight column– SUM(DollarAmt)– Some facts are double-counted in totals– Each customer is fully credited for his/her activity– Most useful when grouping by customer
![Page 14: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/14.jpg)
Loading the Data Warehouse
Source Systems Data Staging Area Data Warehouse
(OLTP)
Data is periodically extracted
Data is cleansed and transformed
Users query the data warehouse
![Page 15: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/15.jpg)
Staging Area vs. Warehouse
• Data warehouse– Cleansed, transformed data– User-friendly logical design– Optimized physical design
• Indexes, Pre-computed aggregates
• Staging area– Intermediate representations of data– “Work area” for data transformations
• Same server or different?• Separate staging server and warehouse server
– Run extraction in parallel with queries• Staging area and warehouse both part of same database
– Less copying of data is required
![Page 16: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/16.jpg)
Alternating Server Approach
Warehouse
Staging
Warehouse
Staging
LoadNewData
Backup
Make a Copy
Staging
Warehouse
Swap Roles
Load the Copy
![Page 17: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/17.jpg)
Surrogate Key Assignment• Maintain natural key → surrogate key mapping
– Separate mapping for each dimension table– Can be stored in a relational database table
• One or more columns for natural key• One column for surrogate key
• Need a separate mapping table for each data source– Unless data sources already use unified natural key scheme
• Handling multiple dimension rows that preserve history– 1st approach: Mapping table contains surrogate key for most
current dimension row– 2nd approach: Mapping table lists all surrogate keys that were
ever used for each natural key• Add additional columns to mapping table:
– Begin_date, End_date, Is_current_flag• “Late-arriving” fact rows can use historically correct key• Necessary for hybrid slowly changing dimension schemes
![Page 18: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/18.jpg)
Detecting Changed Rows
• Some source systems make things easy– All changes timestamped → nothing to do!– Usually the case for fact tables
• Except for Accumulating Snapshot facts– For each source system, record latest timestamp
covered during previous extraction cycle• Some source systems just hold snapshot
– Need to detect new vs. changed vs. unchanged rows– New vs. old rows: Use surrogate key mapping table– Detecting changed vs. unchanged rows
• Approach 1: Use a hash function– Faster but less reliable
• Approach 2: Column-by-column comparison– Slower but more reliable
![Page 19: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/19.jpg)
Handling Changed Rows
• Using a hash function– Compute a small summary of the data row– Store previous hash value in mapping table– Compare with hash value of current attribute values– If they’re equal, assume no change
• Hash table collisions are possible– Cyclic redundancy checksum (CRC)
• Commonly used hash function family• No collisions under local changes and byte reorderings
• Determine which attributes have changed– Requires column-by-column comparison– Store untransformed attribute values in mapping table– Choose slowly changing dimension approach based on changed
attributes
![Page 20: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/20.jpg)
Dimension Loading Workflow
Natural key in mapping table?
Has rowchanged?
Which SCDtype?
Do nothing
Update row indimension table/Mark aggregates
Update row inmapping table
Insert row in dimension table
Insert row in mapping table
Insert row in dimension table
Update row inmapping table
Yes
No
Yes
NoType 1
Type 2
![Page 21: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/21.jpg)
Duplicates from Multiple Sources
• Information about the same logical entity found in multiple source systems
• Combine info into single dimension row• Problems:
– Determine which rows reference same entity• Sometimes it’s hard to tell!• Referred to as the merge/purge problem• Active area of research
– Resolve conflicts between source systems• Matching records with different values for the same field• Approach #1: Believe the “more reliable” system• Approach #2: Include both values as separate attributes
![Page 22: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/22.jpg)
Merge/Purge
• How can we determine that these are the same person?
• Fuzzy matching based on textual similarity• Transformation rules• Comparison to known good source
– NCOA: National Change Of Address database• Related application: Householding
– Can we determine when two individuals/accounts belong to the same household?
– Send one mailing instead of two
FName LName Address City Zip
B Babcock 3135 Campus Dr #112 San Mateo 94403
Brian Bobcock 3135 Compass Drive Apt 112 San Mateo 94403-3205
![Page 23: CS 345: Topics in Data Warehousing Thursday, October 14, 2004](https://reader036.vdocuments.net/reader036/viewer/2022062304/56649f2c5503460f94c474ab/html5/thumbnails/23.jpg)
Next Week: Query Processing
• We’re done with logical database design• Next topic: how can the database answer
queries efficiently?• No textbook from here on
– Optional readings for each topic will be posted on the course web page
• Mostly research papers• Some readings from books on reserve in Math/CS library
• Tuesday’s topic: Query processing basics– Will be review if you’ve taken CS 245– Haven’t taken CS 245 → may want to read DSCB
Chapter 15.1-15.5, 15.8