1 acctg 6910 building enterprise & business intelligence systems (e.bis) data staging olivia r....

22
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business

Post on 15-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

1

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

Data Staging

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Page 2: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

2

TechnicalArchitecture

Design

TechnicalArchitecture

Design

ProductSelection &Installation

ProductSelection &Installation

End-UserApplication

Specification

End-UserApplication

Specification

End-UserApplication

Development

End-UserApplication

Development

The Business Dimensional Lifecycle

ProjectPlanningProject

Planning

Business

Requirement

Definition

Business

Requirement

Definition

DeploymentDeploymentMaintenance

andGrowth

Maintenanceand

Growth

Project ManagementProject Management

DimensionalModeling

DimensionalModeling

PhysicalDesign

PhysicalDesign

Data StagingDesign &

Development

Data StagingDesign &

Development

Page 3: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

3

Data Staging

Data Warehouse (Oracle)

DB2

Access

Excel

Legacy System

Data Staging

Page 4: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

4

Data Staging

• Extraction• Data Cleansing• Data Integration• Transformation• Transportation (Loading)• Maintenance

Page 5: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

5

Extraction

• Extract source data from legacy systems and place it in a staging area.

• To reduce the impact on the performance of legacy systems, source data is extracted without any cleansing, integration and transformation operations.

Page 6: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

6

Extraction

• A variety of file formats exist in legacy systems– Relational database: DB2, Oracle, SQL

Server, Informix, Access …– Flat file: Excel file, text file

• Commercial data extraction tools are very helpful in data extraction.– Ex: Oracle Data Mart Builder

Page 7: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

7

Data Preparation (Cleansing)

It’s all about data quality!!!

Page 8: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

8

Outline

• Measures for Data Quality • Causes for data errors• Common types of data errors• Common error checks• Correcting missing values• Timing for error checks and

corrections• Steps of data preparation

Page 9: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

9

Measures for Data Quality• Correctness/Accuracy - w.r.t. the real data• Consistency/Uniqueness – data values,

references, measures and interpretations• Completeness - scope of data & values• Relevancy – w.r.t. the requirements• Current data – relevant to the requirements

Page 10: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

10

Causes for Data Errors

• Data entry errors• Correct data not available at the time of

data entries• By different users same time or same users

overtime – Inconsistent or incorrect use of “codes”– Inconsistent or incorrect interpretation of “fields”

• Transaction processing errors• System and recovery errors• Data extract/transformation errors

Page 11: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

11

Common Data Errors

• Missing (null) values• Incorrect use of default values (e.g., zero)• Data domain integrity violation (e.g., 0/1)• Data value (dependency) integrity violation

(e.g., if mm=02 then DD<30)• Data referential integrity violation

(e.g., a customer’s order record cannot exist unless the customer record already exists)

Page 12: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

12

Common Data Errors, Cont’d

• Data retention integrity violation (e.g., old inventory snapshots should not be stored)

• Data Derivation/Transformation/Aggregation Integrity Violation (e.g., profit not = sales – costs)

• Inconsistent data values of the same data (M versus m for male)

• Inconsistent use of the same data value (DM for Data Mining and Data Marts)

Page 13: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

13

Error Checks

• Domain value validation • Value dependency validation• Referential integrity validation• Identify missing-value or default-value records• Identify outliers• Cross-footing -Check aggregates and

derivations across different levels and against common sense

• Eyeballs!• Process validation

Page 14: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

14

Data Cleaning: Missing Values

1. Exclude the record2. Exclude the attribute/field3. Replaced by a global constant 4. Replaced by the attribute mean5. Replaced by the most probable value6. Apply 4 – 6 by class/segments of records7. Manual correction8. Application specific algorithm1-6 are less practical for OLAP bound data

Page 15: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

15

Timing for Error Checking

• During Data Staging• During Data Loading• Others

– Before data extraction (data entries, transaction processing, recovery, audits, etc.)

– After data loading

Page 16: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

16

Steps of Data Preparation• Identify data sources• Extract and analyze source data• Standardize data• Correct and complete data• Match and consolidate data• Analyze data defect types• Transform and enhance data into target• Calculate derivations and summary data• Audit and control data extract,

transformation and loading

Page 17: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

17

Data Integration

• Data from different data sources with different formats need to be integrated into one data warehouse– Ex: 3 customer table in sales

department, marketing department and an acquired company

Customer (cid, cname, city …)Customer (customerid, customername,city…)Customer (custid, custname, cname,…)

Page 18: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

18

Data Integration

• Same attribute with different name: cid, customerid, custid

• Different attribute with same name: – cname -> customer name– cname -> city name

• Same attribute with different formats

Page 19: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

19

Data Integration

• How to integrate

– Get the schemas of all data sources

– Get the schema of the data warehouse

– Integrate source schemas with the help from commercial tools and domain experts

Page 20: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

20

Transformation

• Prepare data for loading into the data warehouse– Change the data format– Create derived attributes and tables– Aggregate– Create warehouse keys

Page 21: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

21

Transportation

• Using bulk load tools, such as Oracle SQL Loader, instead of SQL command

• Create indexes

Page 22: 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of

22

Maintenance

• Maintenance frequency: daily, weekly, monthly

• Identify change records and new records in legacy systems– Create timestamps for changes and new

records in legacy systems– Compare data between legacy systems and DW

• Load changes and new records into DW